in DevOps

The science of top-down debugging

I’ve found that the biggest difference between effective debugging and ineffective debugging is the process used to root out a problem. Inexperienced people too often will hit their head against a wall without even knowing how to take the next step for solving a problem. Their problem is not following an age old technique that is responsible for lifting mankind out of the dark ages: the scientific method!

In any situation, I (and literally centuries of human progress) claim the following steps will eventually lead to an answer, or at least a reason why you can’t figure out the answer. The best part? It’s recursive! Just ask physicists.

Steps for top-down debugging

The trick to debugging a problem is to use the power of deduction. Start with research, create a hypothesis, find a way to prove my hypothesis, execute the “experiment”, and then refine the hypothesis. For debugging, I am looking for the action that disproves my hypothesis, as I start with the assumption that “if this system were working, it would {insert action}”.

Find an interface

An interface takes many forms. It may be the User Interface, like a webpage, it may be an api, it may be a single line of code. It is something with a definable surface area on or near the thing you’re trying to debug. Hopefully it’s a function, because you might be able to wrap a unit test around it and ensure that nobody ever checks there again.

Common interfaces include:

  • user interfaces
  • function calls – the best kind
  • api endpoints
  • system calls (try “strace -ff -t -s 1000 -p {process id}” to experience the matrix)
  • poorly structured blocks of code
  • state variables?
  • Writing out to a database
  • Writing out to files?
  • Just wires. lots of wires.

Define expected behavior

For this part, you may need to ask around, use intuition, or check documentation, but you need to understand what “working” looks like. If you don’t know what SHOULD happen, you don’t stand a chance of fixing the problem.

Find a way to verify expected behavior at that interface

This can be anything from a unit test (preferred / even becomes easier the deeper you go) to manually looking at the output. This is sometimes very tricky and will lead you to learning many new tools. If you don’t know a unit testing framework, code katas, like this are a great way to learn by example. Google for “prime factors kata in {language}” to find one in your language. Learn to do this by yourself.

In order to find your interface to test, a debugger is often one of the most precious tools you can use. All major browsers have one (hit f12), and every “real language” has one as well. Learn one. It’s going to help you. They are often built into IDE’s. If you can’t use a debugger (maybe the issue really only happens in production!), you may be relegated to using loggers, or worse, printf. Find something that can get the job done.

Here is a set of tool classes that you should have some familiarity with:

  • A unit testing framework
  • A mocking framework
  • A logging framework
  • A step debugger
  • netcat / telnet / some sort of tcp-based communication tool
  • A profiling tool
  • curl / wget / some http-based network tool (for web)
  • the built-in browser debugger (for web)

If possible, make that work an “investment” that can continue to improve the codebase; make it an automated test.

If there’s a difference, dig into that interface.

At the level you are testing, you will find something that breaks your hypothesis. This is the chunk of code to investigate next. Open it up, and examine it. Generate hypotheses, and repeat.

If everything looked like it worked, think outside the box

Something obviously didn’t work, otherwise you would not be investigating this!

Think about the resources being consumed. Are any of these being constrained?

Disk is finite. So is memory, cpu, and even the number of threads you can use. Are any of these limits being reached?

  • Compare disk usage to mount availability with “df -h”
  • Check disk utilization, file I/O, and long-term cpu usage with “sar”
  • Compare cpu usage with number of cpu’s with “w” and “cat /proc/cpuinfo”. Sometimes you’ll be surprised to find out that a cpu is not running as fast as you would think.
  • Check memory usage with “free -m” (also, please read linuxatemyram.com)
  • Check file handle counts vs available file handles using “lsof | wc -l” and “ulimit -a”
  • Check thread counts with “ps uxH | wc -l” and “ulimit -a” (again)

If any one of these things are constrained, you are in real trouble, and, if any thing else, you should probably fix that.

Is this a timing-based issue? What happens when I add a “gate” around the code in question?

Timing-related issues are notoriously hard to solve, and they are due to an assumption being wrong about the order of how things happen. A common symptom of a timing-related issue is if adding log messages makes the problem goes away. A great way to debug threading issues is to see if it still happens when you get rid of threading issues.

A technique I employ when I run into these issues in java is the addition of a synchronized block:

synchronized void questionablyThreadsafeFunction() {
   // begin code that is blatantly not threadsafe
   this.thingsThatArentThreadSafe++; // totally atomic ;P
}

This sort of investigation is usually done after I have a test that reliably re-creates the issue, like by calling that function on the same instance of a class thousands of times on hundreds of threads.

Additional Tweaks

Use the path of least surprise

Just because you “know” your code is flawless does not mean the first place you should check is for bugs in your language’s implementation of “string”. That doesn’t mean the bug isn’t there, but you should focus on the areas that are most likely to be the source of your error. Protip: It’s usually your fault.

Reducing Cycle Time

I am also looking for the test that has the highest “effectiveness / cycle time” quotient. I try to find ways to test on my desktop, then in dev, then in test, and then (if all else fails) in production. A lower cycle time is the key to productive debugging. In my experience, a totally out there bug will take you to a search depth of about six to find a root cause. If your cycle time is greater than an hour, there is likely enough time for you to spend an hour learning a new skill to reduce that cycle time that will pay for itself on this bug alone.

The complexity of this technique

From a computational complexity standpoint, this code has a worst case search depth of K \cdot log(N), with K being the number of “chunks” you divide your problem into. Even terrible code can at least be split into parts with very large log statements covering the current state of things. It’s logarithmic.

Wrapping it up

Following these steps will inevitably lead to the solution. In your career, you will often be surprised at which interface fails you! Sometimes the filesystem will be full, causing a logger to hang. Sometimes /etc/host.conf will no longer be resolving a domain name. Very occasionally, the compiler itself will have failed you. Anything and everything can and will break once in a while. Regardless of the problem, if you are following these steps, you will find a root cause, or at least the reason why you can’t find a root cause (e.g. you don’t have proper permissions to test a network interface).