One of the most critical and least popular activities when developing and maintaining software is debugging – fixing code that has been written incorrectly. Although ubiquitous, debugging is a poorly understood activity. Different people use a wide array of techniques, and it gets little attention in programming courses. Such heaps of books and papers and articles have been written on design and team methodology and testing that you can hardly avoid them when you work in software, but you have to go looking for information on debugging. Though perhaps few would say so if you asked them in so many words, people seem to tacitly assume that debugging is an activity that just is and can’t be improved, or that it’s a matter of innate talent rather than a learnable skill. And no other development activity frustrates so frequently or consistently (difficult debugging sessions often elicit loud swearing from even otherwise soft-spoken developers).
Regardless of what may be involved in the skill, I’ve found over the years that debugging techniques have taught me a lot about thinking effectively in other areas of life. In this series, I hope to share a little bit of this wisdom with you even if you don’t know anything about software development. Maybe that’s a tall order, but I’m going to do my best.
This week, we’ll start by discussing bugs themselves and the process of debugging so we’ll have some vocabulary and a foundation to talk about techniques.
Where do bugs come from?
At its most basic, a software bug is an error in a program that results in incorrect or unwanted behavior. To answer the question in the section header, people wrote the programs, so people made the bugs. But people might have made them for a wide variety of reasons. Bugs can be classified as one of several broad types, which have different causes. Other classifications are possible besides the one I present here, but I find this one especially helpful in understanding causes.
Syntax errors are the most basic type of bug and the easiest to resolve. They’re caused by mistyping the code or not knowing the programming language well enough. Here’s a typical syntax error, in the Python language (I pick Python because it’s English-like and pretty readable even if you don’t know anything about programming):
The intended purpose of this code is to multiply 5 by 3, then add one to it, and print out the resulting number. (Like in algebra, we work from the innermost parentheses out.) In the first two paragraphs, we told the computer what “add one” and “multiply by three” meant, and then in the last one we used those definitions. In theory, we ought to get 16, since 5 * 3 + 1 = 16. Here’s what actually happens when we run it:
File "test.py", line 8 ^ SyntaxError: unexpected EOF while parsing
That error’s pretty cryptic, but can you find the mistake on the last line?
I opened three sets of parentheses –
but closed only two of them –
Syntax errors are infuriating when you first start programming – most people aren’t used to seeing the ways symbols line up and the kind of typographical errors that humans look right past but computers get hopelessly confused by, so they’re easy to create and difficult to spot once made. However, after you’re familiar with a programming language, syntax errors become trivial annoyances that can usually be resolved in seconds – if the error is unusually well-hidden, it might take five minutes. Further, the program usually won’t run at all if it contains a syntax error, so these bugs are obvious and therefore rarely get any further than the programmer’s desk before being identified and extinguished.
Logic errors occur when the code is mechanically correct but the programmer didn’t correctly translate what tasks needed to be done into code, so the code carries out the wrong steps and consequently gets the wrong answer. For example, let’s make a list of two names and then print the first name on the list:
When we run that, we get:
The first person is Bob
Bob is emphatically not the first person on that list, so what gives?
As we intended, we created a list of people
– two of them, to be exact, Alice and Bob –
and then used the index operator
 to get one item from the list.
But in Python, the first item in a list is referred to by index 0, not 1
(this actually makes great sense for numerous reasons,
but it takes a few months of programming to fully understand why!).
Therefore, 1 actually referred to the second one.
Here’s a classic logic error that expands on the above:
while introduces a loop,
where the same code is run multiple times
under slightly different conditions.
We start with
counter equal to zero,
because as we just showed above,
the first item in a list in Python is item 0.
Each time through the loop,
we print the item with the index number stored in
then we add one to
so the next item from the list will print next time.
When the condition after the
while is no longer true
counter reaches a number
greater than the number of people in the list,
meaning we’ve seen them all),
we stop looping.
That’s the theory, anyway; we expect to see Alice and Bob printed out. Let’s try actually running it:
Alice Bob Traceback (most recent call last): File "<stdin>", line 4, in <module> IndexError: list index out of range
That’s not right! What happened?
Well, we tried to run the loop three times
but there were only two items in the list.
Why did we try to run the loop three times?
Think a bit closer.
We checked if
counter was less than or equal to the number of people.
That makes sense, because we want to print the last person’s name
as well as the ones before it
– except we started with index zero,
so our reckoning is off by one.
We should have checked if
counter was strictly less than the length,
or else added one to the value of
counter on each comparison.
This infuriatingly common mistake is called an off-by-one error,
and anyone who’s ever tried programming will recognize it at once.
Logic errors can get almost arbitrarily complicated. They might involve implementing an established algorithm incorrectly in a non-obvious way that hides for months. They might cause program crashes only in very obscure situations which make the problem hard to identify because they occur so rarely. They might even cause security vulnerabilities.
Logic errors are often the worst type of error to debug. Some are obvious enough, like the examples above, and are found easily during testing. But logic errors often manifest in unpredictable ways and end up in the places you would least expect them, and the worst of them show up only under quite specific circumstances. That allows them to hide for months or even years and makes them hard to find and easy to write off as flukes or user error even once they’ve been reported.
Logic errors are a specific form of semantic error. Semantic errors in general are those in which the code is syntactically correct but doesn’t do what the programmer intended. Here’s a semantic error that isn’t a logic error:
Reading the names the programmer chose to use in this code,
it’s clear that she intended to divide 12 by 3, printing 4;
but the code actually divides 3 by 12, printing 0.25.
divide_by_three doesn’t do what it says.
General semantic errors make you slap yourself when you spot the problem, but can be nearly as difficult to find as logic errors. In particular, it’s easy to look right at a semantic error and not see it, because the names and comments written in the program typically don’t match what’s actually happening, and we tend to use those over the actual instructions to figure out what the code is doing.
A remarkable number of bugs actually occur when a programmer writes code that is syntactically correct and even does exactly what he intended but solves the wrong problem, usually because he isn’t fully informed about what the problem entails.
A famous real-life example is the Mars Climate Orbiter, which was destroyed in 1999 during its orbital insertion maneuver because, due to a miscommunication between NASA and Lockheed Martin, it received force data in imperial units when it was expecting metric units. Lockheed’s software did exactly what the programmers intended it to do – it didn’t crash, the math was all right, and it returned the answer they had designed it to return, the required thrust in pound-force seconds – but the question they were trying to answer was the wrong question.
Requirements errors tend to be relatively easy to identify and locate when compared to logic errors, though they can sometimes be challenging. However, they’re often among the most obnoxious because fixing them can require ripping up and rewriting significant portions of code. In the worst case, an entire system may have been designed around the incorrect assumption, requiring hours or days of rework. The problem is compounded because the programmer frequently won’t notice them when testing her own code; after all, she’ll be testing it against her own (wrong) idea of the requirement. In the worst case, of course, you send an expensive spaceship all the way to Mars before you notice.
What is debugging?
Obviously, debugging is the process of getting rid of bugs, but that isn’t too enlightening. This Reddit post put it better than I could:
Debugging is the art of finding out what you really told your computer to do instead of what you thought you told it to do.
Debugging is largely about finding bugs, not fixing them. Fixing bugs, except sometimes requirements bugs, is more often than not easy and straightforward. A bug might be as simple as a missing period, which can be fixed in two seconds by tapping the period key on the keyboard, but it might take two hours – or even two days, in a sufficiently complicated system – to find that missing period. Finding that period can require all the programmer’s skills.
The significance of debugging
I would argue that debugging is a more fundamental activity to software development than actually writing the code in the first place. That’s for several reasons:
- Nobody can escape debugging. Even the best programmers write code filled with bugs that have to be identified and fixed.
- Meanwhile, even if you’ve never written a program in your life, you could go write some Python code that doesn’t work right now. It probably won’t even be syntactically valid, but you could do it. Making it work is the hard part.
- Software that hasn’t been debugged is useless. Even bad software that is free of bugs can be useful – at least it does what it’s supposed to do.
- Debugging is hard. Brian Kernighan, writing in a seminal book on the enormously influential C language, quipped: “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” Writing the code is easy – you just explain your understanding of the problem. Debugging is harder. Your own understanding won’t cut it; if your code isn’t logically perfect, you come face to face with that fact.
- Debugging is fundamentally creative, at least when it gets beyond a certain level of difficulty. You can always choose between multiple ways of attacking a problem, and with reasonably complicated bugs, isolating the bug usually comes in a flash of insight. Some programming is creative in the same way, mostly the part that has to do with math and algorithms, but a lot of it is putting pieces together in a straightforward way using established patterns.
The process of debugging
How do you find a bug that you’ve identified? First you have to reproduce it. If you can’t even make the bug happen, it’s next to impossible to figure out why it’s happening. So you first need to identify under what conditions it occurs. That could be as simple as, “when I click the OK button here, the program crashes every time,” or it could require certain data to be entered before clicking OK, or it could involve pressing a series of keys in the exact right order at the right time at such a rapid speed that only a skilled operator and not a programmer could hope to reproduce it (see Therac-25).
Occasionally just identifying the exact circumstances is enough to set off the lightbulb, particularly if you’re familiar with the system and just wrote the offending code a few minutes ago. But most of the time you have to proceed by gathering information. Without some kind of information, you’ll just be taking random shots in the dark. Depending on the bug, the system you’re working on, and your software development practices, different options will be available. Here are some common sources of information:
- Recent changes: If you just added 10 new lines of code and now something doesn’t work when it did before, it’s a fair bet that the bug is somewhere in those 10 lines of code. In that case, you might just be able to go back and read over those 10 lines and spot the error.
- Older changes: If the bug has been present for longer, you can go back and look at source control history (i.e., an annotated log of all the changes people have made to the program). If you’re lucky and your team places a priority on keeping a clean, useful history, you might find annotations pointing to the source of the problem. Even if you’re not lucky, you can at least identify which change introduced the error (in the worst case, by running each version of the code and seeing whether the bug occurs or not) and have a much smaller area to search.
- Logging: Many programs are designed to write information about what they’re doing to a file, or to send it to the developer over the Internet, which may help in identifying what actions the program was taking before it went astray. In my examples of bugs earlier, I showed tracebacks provided by Python itself, explaining which line of code caused the immediate error. That won’t necessarily be the original source of the bug (since an incorrect result might not be used and cause the program to fail until later), but it often is, and in any case it’s a great start. If no useful logs are available but you have some idea of where to begin, you can add some logging to a section of code that seems suspicious.
- Interactive debugging: With appropriate tools available for common programming environments, instead of getting an explanation of what happened after the fact, you can pause the program at important points while it’s running, inspect what it’s doing and what data it’s working with, run extra code in the middle to experiment and see if it fixes the problem, and sometimes even reverse the program’s execution to try it again.
- Network traffic analysis: For programs that send information through the Internet, you can use special software that intercepts the requests the programs are sending each other and lets you look at them.
- Research: While you may be the first person to encounter this exact problem, other people have probably encountered similar ones, and a quick Google search can turn up valuable tip-offs about where to look. Poring through the official documentation for systems you’re working with may also help you identify misunderstandings that led to semantic or requirements errors.
Once the source has been narrowed down somewhat, you have to look at the code that seems to be the source of the problem. I won’t talk much further about the techniques here, because many of these correspond to cognitive tools that I’ll be discussing in posts in this series. Ultimately, you have to work out where your thinking was in error when you wrote the code (or, often enough, where someone else’s thinking was in error, if you weren’t the original author, which is even harder). You might go back and forth between gathering information and studying the code, identifying theories, testing them, and proceeding to the next when they’re wrong. And eventually, the bug shows up, and you can fix it.
In the next posts in this series, I’ll veer away from software itself, touching only enough of it to ground my arguments and explanations, and go into how some of the techniques programmers use for debugging can be helpful in everyday life.