Cartesian Skeptic: 2008/06

I tend to call any troubleshooting process "debugging." A week or so ago I debugged our lawnmower when it wouldn't start. Today I spent a lot of time (unsuccessfully) debugging a mysterious apache child crash related to a php application. Tomorrow I will undoubtedly debug some software at work and some other things in "real life."

I would guess that roughly 40% of my life as a professional programmer and system administrator has been spent debugging. By that I don't mean to imply that I only write broken code (although I probably do); what I mean is that like most writing is rewriting, most programming is debugging. A first attempt at solving a problem rarely works correctly. If it was that easy to solve it must not have been much of a problem in the first place right?

For me debugging software/hardware/whatever is mostly about thinking of the most common things that could be causing the malfunction and then systematically proving that these things are or are not the cause of the current problem. When my dad taught me how to keep our crusty old lawnmower running when I was a kid he gave me the checklist "fuel, fire, air." This magic formula has pointed to the cause of 90% of the problems I have ever had with engines.

Does it have gas in the tank?
Is the gas getting to the cylinders?
Is the battery/magneto/alternator working?
Is the charge making it past the points/distributor?
Is the spark plug fouled?
Is the air filter clogged?
Is the choke stuck closed?

The order here is important too. On each branch of the decision tree we start at one end of the subsystem and walk towards the other end checking the key weaknesses in between. The branches are also arranged in order of most frequent failure. All this ordering hopefully get you to the root problem earlier rather than latter. You need to tailor the tree based on prior experience. When working with the beater Datsun I drove in college, fire came before fuel because it had constant alternator and distributor problems.

Software debugging works just like the mechanical troubleshooting process for a malfunctioning engine. The only difference is the decision tree. If I'm working on a network connectivity problem the quick checklist usually goes something like "interface, firewall, routing."

Is the physical link in place?
Is the interface active?
Is there a local firewall block?
Is there an upstream firewall block?
Is there an outbound route?
Is there an outbound route from the next hop?

Just like the lawnmower to car differences, this list gets modified based on what I'm working on. This order worked best for border connectivity issues on the frame relay networks I used to work with. The network I'm on today calls for the firewall checking first because 50% of the time that's where the problem is.

So how do you construct this sort of list when you run into a problem you've never seen before? I tend to use first principles or cartesian doubt technique. Based on the nature of the failure you are seeing and the knowledge you have of the system, think of everything that could have gone wrong. Start with a likely cause and follow it out until you are satisfied that particular subsystem is working. Move on to the next and repeat. If the problem is in code that you (or someone you know) wrote think about the things that are typical problems in your code. Do you often make fencepost errors? Are you working in a language that is prone to API mistakes like calling a method with the arguments in the wrong order? Are you working in a threaded environment and experiencing a concurrency issue?

I wish I had something better to say at the end of all this, but maybe some part of it is non-obvious: think about what could be broken, verify that it's not broken, repeat. Sounds pretty simple when I put it like that, but there's some magic in there that I can't quite elucidate.

Cartesian Skeptic

Debugging for Fun and Profit

Blog Archive

Daily Reading

About Me