Cartesian Skeptic: debugging

I tend to call any troubleshooting process "debugging." A week or so ago I debugged our lawnmower when it wouldn't start. Today I spent a lot of time (unsuccessfully) debugging a mysterious apache child crash related to a php application. Tomorrow I will undoubtedly debug some software at work and some other things in "real life."

I would guess that roughly 40% of my life as a professional programmer and system administrator has been spent debugging. By that I don't mean to imply that I only write broken code (although I probably do); what I mean is that like most writing is rewriting, most programming is debugging. A first attempt at solving a problem rarely works correctly. If it was that easy to solve it must not have been much of a problem in the first place right?

For me debugging software/hardware/whatever is mostly about thinking of the most common things that could be causing the malfunction and then systematically proving that these things are or are not the cause of the current problem. When my dad taught me how to keep our crusty old lawnmower running when I was a kid he gave me the checklist "fuel, fire, air." This magic formula has pointed to the cause of 90% of the problems I have ever had with engines.

Does it have gas in the tank?
Is the gas getting to the cylinders?
Is the battery/magneto/alternator working?
Is the charge making it past the points/distributor?
Is the spark plug fouled?
Is the air filter clogged?
Is the choke stuck closed?

The order here is important too. On each branch of the decision tree we start at one end of the subsystem and walk towards the other end checking the key weaknesses in between. The branches are also arranged in order of most frequent failure. All this ordering hopefully get you to the root problem earlier rather than latter. You need to tailor the tree based on prior experience. When working with the beater Datsun I drove in college, fire came before fuel because it had constant alternator and distributor problems.

Software debugging works just like the mechanical troubleshooting process for a malfunctioning engine. The only difference is the decision tree. If I'm working on a network connectivity problem the quick checklist usually goes something like "interface, firewall, routing."

Is the physical link in place?
Is the interface active?
Is there a local firewall block?
Is there an upstream firewall block?
Is there an outbound route?
Is there an outbound route from the next hop?

Just like the lawnmower to car differences, this list gets modified based on what I'm working on. This order worked best for border connectivity issues on the frame relay networks I used to work with. The network I'm on today calls for the firewall checking first because 50% of the time that's where the problem is.

So how do you construct this sort of list when you run into a problem you've never seen before? I tend to use first principles or cartesian doubt technique. Based on the nature of the failure you are seeing and the knowledge you have of the system, think of everything that could have gone wrong. Start with a likely cause and follow it out until you are satisfied that particular subsystem is working. Move on to the next and repeat. If the problem is in code that you (or someone you know) wrote think about the things that are typical problems in your code. Do you often make fencepost errors? Are you working in a language that is prone to API mistakes like calling a method with the arguments in the wrong order? Are you working in a threaded environment and experiencing a concurrency issue?

I wish I had something better to say at the end of all this, but maybe some part of it is non-obvious: think about what could be broken, verify that it's not broken, repeat. Sounds pretty simple when I put it like that, but there's some magic in there that I can't quite elucidate.

A lot of the blogs that I read are about programming and project management. They are predominantly written by people that have a lot of online credibility as "experts" in their field. It dawned on me recently, that although I don't have the interwebz cred that they do, I actually have as much or more real world programming and project management experience as most of these "experts".

The funny thing about getting old is you don't realize it.
— Dolly Parton (and I'm sure a lot of other people but that's the google hit I got for the phrase)

Anyway, if being 30+ years old and having worked as a professional programmer for 10+ years qualifies some other guys to pontificate about the business of programming and hit the front page of digg for doing it, I might as well do some of it too.

"I think ..." is a phrase that strikes fear into my heart when uttered during discussions about debugging, troubleshoting or project status. I think means that the speaker is not confident enough to say I know. I know means that it is a dead certainty that the following statement is fact. It should mean that the statement has been verified before sharing it with the team. I think means that something is being vaguely recalled or assumed based on history, pride or prejudice.

I heard the dreaded I think phrase three times at work today. Two out of the three times the speaker was soon proven wrong via empirical research. That in and of itself is not a big deal. People are wrong all the time; it's to be expected actually in the software world. The dangerous thing about I think is that it costs valuable time. A potential failure mode has been identified and the question has been asked if it could be the true cause of the current problem. By thinking this is not the problem it is discarded as a root cause and other avenues are searched. Unless someone else forcefully believes that the cause which is now thought by the group to be disproved it will not be revisited until many other options have been exhausted.

In my experience, it usually only takes a few minutes to know something instead of just think it. These few minutes spent upfront can be critical when an emergent problem is being dealt with. Troubleshooting (at least good troubleshooting) is the process of running through an n-ary tree with a depth first spanning algorithm. The early decisions prune large portions of the search space and if a partition is discarded falsely, it will take great effort to resurrect it.

I'm going to write another post at some point about how I approach building and ordering the decision tree because order is crucial for efficient operation, but the point I'd like to get across today is that each hypothesis needs to be tested in realtime. Know that it is true or false and you will save a lot of time and confusion in the long run. When the issue at hand is a loss of several thousand dollars of corporate income per minute it can make the difference between being fired and getting the biggest bonus of your life. If you approach every problem as thought it were that important you will become a better programmer/sys admin/manager/mechanic/whatever.

Cartesian Skeptic

Debugging for Fun and Profit

Knowing vs Thinking

Blog Archive

Daily Reading

About Me