A Discipline of Errors

In a previous life, I had to write a discipline of exceptions for the users of a library I had built for c++ games.

In this incarnation, I’m working in Go, and need to do something of the same sort, so here’s my pitch: Some errors are typos by the user.  Others are things they’ve said properly, but which the program can’t do. Still more are things that could be tried, but didn’t succeed. And finally, there are things that blew sky-high, about which none of us can do much. We have to handle them all.

Typos, Spellos and Thinkos

These are the usual kind of error: the user asked for something we could tell was impossible, so we say so. In an interactive program we then loop the user back to the input form and ask them to try again. In my previous life tI threw an exception that restarted the step at the form-entry point.

In a command-line program, we can call log.Fatalf() and let the user re-edit the command-line.  In a batch program taking its input from a control (“.ini”) file we can do the same thing: the program can quickly fail, and the cron or other job-control program will report that to the user. They can then fix the problem.

A daemon is a bit trickier: the daemon-manager will try to restart it a few times, give up and notify a human. The human will have to read logs to see what failed and what they should change. That person may initially be the user, but it will usually be you, so you need a good log message to remind you what the heck this was.

The cron-daemon border is where there is a change in nature for command-line programs. Immediate failure in cron causes immediate notification of the user, usually via email. Immediate failure in daemons, not so much. The user only gets delayed notification. They they call you.

As we get farther and farther from an interactive program towards a long-running batch or big-data job, things get harder.

Syntax correct, Semantics not so much

Others problems are things they’ve said properly, but which the program  can’t do.

If we find out quickly, it’s the same as a typo: we report, the user fixes it, and away we go. If we don’t, then after some time an interactive program shudders to a halt, explains what just happens and invites the user to start again.

In my previous life this was an exception type that had to be sure it was completely cleaned up, so that it could start the start all over again without causing a fiasco. That isn’t always easy. Sometimes it’s impossible, and in that case, you had to treat it as the next case and exit.

In a command-line program, it’s an ordinary log.Fatalf(with a good message), so the user can figure out what to fix.

In a batch or daemon program, what to do is not so clear. If the program has a detailed runbook, the sysadmin who gets the failure report from the daemon-starter can fix it and restart.  Of course, not all batch/daemon programs come with runbooks. That means they’re going to call the author, and you get to figure it out. While the user fumes.

This is the first point at which you may want to call panic(fmt.Errorf(something meaningful)) instead of log.Fatalf().  The extra context might help you remember what the heck caused the issue.

Try and Fail, slowly.

The next harder case is something that runs for a significant time and then fails. Even for an interactive program, the best approach is to shut down and leave as much information as you can for a postmortem. If you’re writing a library, panic-and-recover is probably viable.

Cron jobs will then tell the person who scheduled them, and daemons will try, try and try again, possibly in vain. A good daemon-manager is a joy here.

Debugging

On the other hand, there is a case where extremely minor problems should panic. When you’re developing a program, a “can’t happen” error should panic, even if it’s a case where you can recover. For example,

x, err := someFunction()
if err != nil {
    if crash {
        panic()
   }
    x = 0
}

When running in production, recover. When testing, set a –crash option and diagnose anything that shouldn’t happen. If something shouldn’t happen and does, you have a logic error.

The same applied in c++: most of my exception handlers would crash-stop whenever I was developing or testing. And that includes during regression tests under Jenkins.

A final consideration when you’re first putting a program with recovery code into production: you may want to write a log entry right after the x = 0 above, to say what failed. If you do, it’s good to log it a level that you can turn on and off in production. If you don’t often use “info”, that might be a good setting.  You can turn it on for the first few weeks of production, decide if your recovery strategy is appropriate, and then turn it off.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s