Try, try again (without catch)

In a previous life on clunky old individual machines, I used to have to semi-manually run steps with careful checks in between and stop if anything didn’t work. Gee, now I have clusters, which have gazillions of machines, all of which I have to apply a series of steps to, and stop if anything goes wrong. Again.

This used to be the bane of a sysadmin’s existence. List of many many commands, each of which you need to run and look at carefully.  Scripts are part of the answer, but there is a very high chance that a step will fail, and that you’ll have to debug the script while unbreaking a machine.  Alas, they’re back. Sometimes they never even left.

A modern example, with git.

It doesn’t have to be a cluster. Even sequential things can be hard, if they’re fine-grained like git.
Imagine you need to rebase a branch on master. The manual sequence is

git fetch --all --prune 
git checkout master 
git pull --ff-only 
git checkout branch 
git pull --ff-only 
git rebase --preserve-merges origin/master 
git push --force-with-lease origin branch

You type them in in order, from memory and decide if you want to continue after each step.  If you do it wrong, It Will Be Bad. [The example above contained an error, as I had typed it from memory. Fixed]

Conversely, if you script it straightforwardly, most of the code will be checks between steps. When it fails, and it often fails, you start from the middle, cut-and-pasting out stuff to run.

Make does it better

It’s just a dependency: you can continue with the back end of a list by saying make again and pasting in the uncompleted steps. Of course, make, apl and perl are write-only languages. Not fun for anyone other than the original author.

Ansible is better than make, as it’s more structured, but you have to design your DSL carefully if anyone other than you is to use it.

Or write a toy language

For git, I have a (shell) language with two extra verbs, trySimple and tryRebase.  trySimple does just that, and if it fails, exits the program.  tryRebase tries a rebase, and if it fails, rolls back the rebase with –abort, and then exits the program.

The entire rebase becomes

trySimple git fetch --all --prune
trySimple git checkout ${MASTER}
trySimple git pull --ff-only
trySimple git checkout ${branch}
trySimple git pull --ff-only
tryRebase git rebase --preserve-merges origin/${MASTER}
if [ "$FAKE" = "no" ]; then
        echo "OK to push --force? (^C to abort) "
        read junk
trySimple git push --force-with-lease origin ${branch}

Once I had that, I added an option (-fake) to just print out the steps, so that if they weren’t idempotent and I exited mid-run I could conveniently cut and paste the remaining ones

The general case

Really, the primitives you need are

  • one for simple exit and
  • one for each case where you need to roll back before you exit.

Have those in the language you work in, such as shell functions in your .bashrc or your cluster-admin init script, and you can deal with “old fashioned” problems quite briskly.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s