Errors, as if they were in a safety-critical system

In some systems, errors are really evil. Trains running into cars at railroad crossings, for example.

In computer programs, the errors aren’t of that seriousness, but what if we looked at them from the viewpoint of someone doing safety-critical systems?

Many moons ago, I attended Jonathan Ostroff’s course on safety-critical systems, which studied things like crossing gates not closing before the trains showed up. What if we look at an everyday Go library as if it were a safety-critical system and ask what that teaches us?

First, draw a DFA.

A really simple library might look like this:first

In this diagram, everything inside the oval is “good”, and everything outside better get into a good state darned quick.  Imagine everything inside the circle is a train by itself on a track and everything outside is a pickup truck crossing three feet in front of Mr Train.  Hurry, pickup, hurry!

What If?

Now let’s consider what happens if one of the functions fail.

if Open() fails, it’s easy: we never get into the “good” state, and the programmer, the operator or the daemon who called the library finds out quickly. This handles problems like programmer errors and no  resources nicely. Halt and catch fire, someone other than the library will decide what to do (;-))

if Close() fails, it’s a bit harder. The good case is “I don’t care”, and the program can exit, optionally logging a message saying there was a bug if it wasn’t something we expected.

If Write() fails,  though, we have a bunch of time-consuming problems to consider. Time-consuming, because we no more want the program to hang than we want a pickup truck on the tracks.

If a programmer has passed gibberish to Write(), we want to respond really quickly, just like a failed Open(), so we never send the proverbial train out onto the tracks.

If the data is bad, we have a different problem with a different time-scale. Personally, I look to see if I can ignore it and continue. if I can’t, then that’s another blog post. When I figure it out, I promise to post it!

if the infrastructure that’s supposed do implement the writing is bad, I can do one of three things.

  1. I can retry with exponential backoff
  2. I can close and re-open, then try to write, or
  3. I can throw up my hands and ask for a human to figure it out.

As a DFA, this is

second

  • The first is retry, the green loop around the Write() call, preferably with exponential back-off so you don’t beat the stuff below the library to death.
  • The second is panic and recover, to force a close and open before you try again.
  • The third is “punt”. Ask a human.

In a real safety-critical system, you’d do all three

What do you really do?

As I said, all three.

Start with retry. If the retry is taking too long, panic and recover, so you can start all over again. If that doesn’t work, stop immediately. You may already have squished the truck at the level crossing.

The real consideration is how long things take.

  • If there’s lots of time, earmark part of a panic/recover and use the remainder for retry. If the retry fails, stop and call a human
  • If time is tight, see if you can retry a few times, then stop and call a human
  • if there’s no time, stop and call a human.

Not doing the wrong thing is a good plan. Waking me up in the middle of the night to decide is also part of a good plan. I just don’t particularly enjoy it (;-))

–dave

[See also https://leaflessca.wordpress.com/2018/04/28/a-discipline-of-errors/ which looks at this problem from a different angle]

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s