Capacity Planning as Pain Relief

As you may know from my old blogs, I’ve often done capacity planning, and generally recommend it as a pain-avoidance tool.

However, I was just reading a blog (not by one of my customers!) about how much pain they went through when they didn’t have enough storage performance, and it struck me it should take about an hour to turn a pain-point into a pain-avoidance plan.  So this is how.

Introduction

A company I follow recently decided that they should stay with cloud storage, which was interesting, but the most interesting thing was they pointed out what made them consider having their own storage:  every time load got high, the write journal times went from 2 seconds to forty or more.

screenshot_2017-03-05_14-39-32

Now, if you happen to be doing anything where human beings wait for you, forty seconds is bad. Really bad. Twenty to thirty seconds is the timeout point for human short-term memory. After that long, many of will have completely forgotten what we were doing. When I got back to it, I’d probably assume it had taken even longer, and concluded “my supplier must have crashed”, and start wondering if this was another Amazon-S3-crash.

You can imagine what kind of pain they were in!

Pain Avoidance

However, they also have graphs of the load at the same time, which means that they can calculate one value that will be immensely useful to them: how much their storage slows down under load.

In a similar but read-heavy scenario, I plotted read times against load, and got a scattergram with three distinct regions:

sloth

The first was below about 100 IOPS, where the response time was quite low, as there were a relatively small number of request that came in at the same instant as another and had to wait. Above 100 I/O operations per second, we start having a lot of requests coming in at the same time and slowing down. By 120, we’re starting to see huge backups, with request sitting in queue for 30 seconds and more before  they could get a chance to go to the disk..

Response times vs load always form a “hocky-stick” curve, technically a hyperbola, and can be plugged into a queue modeller like pdq to get a good estimate (the solid line)  If I had a lot more data points at 110-140 IOPS, the scattergram would have shown a definite “_/” shape.

This the thing you need to avoid the pain: the slowdown curve. Once you know it, you can plan to avoid being at the wrong point in it.

Conclusion

If you have ever had a major slowdown, as the bloggers did with their journal writes, ask yourself: do you have the load from the same time period?

If you do, an ordinary spreadsheet will give you a scattergram of slowness vs load, and you can draw the hocky-stick curve by eye. Now you know what to avoid, and the pain you suffered has been turned into data that cna help you never have the problem again.

Know that curve and resolve to avoid the bad parts, forevermore!

 

“DLL Hell”, and avoiding an NP-complete problem

Needing two incompatible version of the same library  can be an evil problem, and a recent paper pointed out that solving it was NP-complete.

“DLL hell” has been a problem in Windows and many of the free and commercial Unixes. Interestingly, it was recognized and fixed in Multics, but then reproduced in both Unix and Windows, and finally re-fixed in SunOS 5. More recently, it was fixed in part in Linux glibc.

This is the story of how.

Introduction

Last week, I had the pleasure of reading about Kristoffer Grönlund’s 2017 linux.conf.au talk, Package managers all the way down , about the recurring problems of package managers for both operating systems and language families.

dll_hell

DLL hell is back, and hitting multi-language projects like Hawk as well as language-specific tools like npm.

Russ Cox investigated the problem, and wrote well-known paper that opens with “Dependency hell is NP-complete. But maybe we can climb out.”

The problem is that a program that use libraries can end up needing two different versions of a subordinate library. For example, I might write a main program that uses glibc n+1, but from it call a library that requires glibc n.

If your system can’t load both, you can’t link.

History

In principle, you can design a linker that can load both versions: the Rust language and the and the Nix package manager have addressed that approach, at the expense of managing a large and ambiguous symbol space.

A elegant idea, however, is to version-number the individual interfaces, and keep both old and new interfaces in the libraries. When you update a library, you get new interfaces and bug-fixes, but you also gets the backwards-compatibility of having the “old” interface in the same file.  Of course, it isn’t always a complete copy of the old versions: often it’s a “downdater” that calls the new one with different parameters, or an “updater” that wraps the old code in a function with additional capabilities.

The idea of having only one copy of a library on the system and keeping old versions of calls came originally from Multics: Paul Stachour wrote about their hardware and software versioning in You Don’t Know Jack About Software Maintenance. It that era, they did continuous delivery, something which required backwards compatibility: they just called it continuous maintenance.

On Multics, all maintained versions of an interface exist in the library, and it is the interfaces that have versions. If someone talks about version 7 of a library, they means a library that includes the interfaces that were added for version 7.

On Unix, that degree of complexity was undesirable: the early Unixes did not link dynamically and the number of libraries was small, so it was hard to have the problem.

Systems with lots of shared libraries and products depending on them, however, made us aware of it once more. Windows gave us the name “DLL hell”, and we promptly recreated it in  SunOS 4, as did the other Unix vendors.  For SunOS 5, my team under David J Brown then recreated the fix, and later discovered that the glibc team had done so too. David described both in a Usenix talk, Library Interface Versioning in Solaris and Linux.

We needed to do so because we had a “compatibility guarantee”. If your program ran and passed the “appcert” script, then we warranted it would run on all newer Solaris versions. If it didn’t, that was our fault, and ours to fix. Failing to fix DLL hell would have cost us significant money!

In Linux, Linus make a similar guarantee, to not break running programs. It too is hard, but manageable inside a mostly-static kernel of moderate size. It’s harder in the application world.

How it Works

Assume my main-program and a library both call memcpy. My program makes sure that the source and target don’t overlap, so it can use the default memcpy.  However, the library does do overlappped copies, and blithely assumed that memcpy did pure left-to-right copying. That’s no longer true, so the library is linked to  a specific memcpy, memcpy@GLIBC_2.2.5, which does make that guarantee. That interface is “downdated” to do strict left-to-right copying by being turned into a call to memmove instead of to the newest memcpy.

To ensure this, the compiler and linker need to be sensitive to the versions needed, and insert the equivalent of “.symver memcpy, memcpy@GLIBC_2.2.5” into the linkage process, using the rules described in David J. Brown’s paper to choose a version old enough to guarantee the needed behavior.

Understanding the implications

Paul Stachour’s paper describes how versioning is used in an across-the-net API, similar to modern web development, to achieve continuous delivery.

David J. Brown’s paper addresses managing many variations over time, and allowing some degree of new-runs-on-old, to avoid forcing developers to keep old compilers around when doing bug-fixes to older versions of a program.

To me, the value during development is huge.  We typically had a number of services in development or use, each with an API and several clients. We would only be working on one of the clients at a time, and wanted to avoid “flag days”, where the team had to stop and change the api of the server and all their clients before we could continue. Those were horrible.

Instead, our process became

  • put the new version of the API on the server, and add a downdater
  • when we next updated each  client, change the client to use it,
  • when all clients use the new version, remove the downdater.

The only oddity you’d notice was that some versions didn’t ever ship, but would be consumed internally.  You’d see versions like 4.1.3 followed by 4.1.9 if you used one of our team’s libraries.

That was not a solution to every linking problem: indeed, we still had the “normal” problem that each time we got  a new release of a library we call directly,  then we found any breaking changes the first time we compiled. But that’s the same read-the-release-notes problem we had before.  therefore we never need to worry about  “vendoring” or “dependency management”in our development.

Conclusions

 

Today, we’re once again faced with the incompatible versions problem, in both language- and os-based package managers.

This time, however, we have the opportunity to see two working implementations of a solution. From those, we have the opportunity take the best of their ideas and properly solve this recurring problem.

Forwards and Backwords Pipes in Go

In the Go language, an often-asked question is “how do I tell if all the goroutines are done?”  I asked myself that recently, while porting a measurement library to use channels (pipes).  As the channels went to goroutines, how would I know when they were done and that I could exit?

The common answer I found was to use a semaphore, a sync.WaitGroup, increment it before calling a goroutine and decrement it at the end of the goroutine. The main program could then wait for it to drop to zero and only then exit. Exiting any earlier would have killed the goroutines, and therefore discard any work they hadn’t finished.

However, some of the elegant examples in books and talks didn’t include waitgroups. For example, Lexical Scanning in Go by Rob Pike didn’t use anything special, and neither did a tiny lexer and parser I had copied from his talk.

Instead my code started a lexer/parser goroutine and then looped, consuming the output,  something like:

go run(l) // Runs the lexer/parser, closes the pipe.
for tok = range l.Pipe {
        slice = append(slice, tok)
        if tok.Typ == token.EOF {
                break
        }
 }
 return slice

The idiom here is to make the last step in the pipeline part of the mainline, and consume the material created in the pipe. In effect I’m build the pipeline protruding “backwards” from main().

I can’t always structure library code this way: in a number of cases I have needed to have the pipe proceed “frontwards” from the mainline and I indeed use sync.WaitGroup there.

However, when I can, I like the idea of writing go this(); go that(); go theOtherThing() and then reading the results in the mainline. It’s elegant.

The “Modem Testing” Problem

I’m a happy user of vagrant, and over Christmas I was setting up a new set of instances for work, using Ubuntu under KVM.  It’s always desirable to have a target VM that exactly matches your production environment, so your testing matches where you’re going to use your work.

Sure enough, it failed.

I know this works under Ubuntu, but my dev machine is running Fedora 25, and that probably hasn’t been tested.  I can run VMs via Virtual Box, that’s fine, but it’s not exactly the same as production.

What has that to do with modems?

Well, in a previous life, I used to work with a company that did v32bis components for modems, and they called any “A doesn’t work with B” problems modem testing problems.

When  modems were fairly new and there were dozens of manufacturers, each manufacturer tested their newest release against their previous release, and then perhaps against an old copy of a well-known competitor’s product, and released them into the wild.

At which point their customer-support line lit up with new customers complaining that their newly bought modem didn’t work with their ISP any more.

The usual answer was “tell your ISP to use our product instead of brand X”.

This was not particularly helpful: their ISP usually said “don’t use anything we haven’t tested, and we’ve only tested brand X”.

That’s exactly what happened here: the combination of vagrant, KVM, Ubuntu and Fedora hasn’t been tested, and, depending on the Ubuntu version, may have a failing NFS or may not have networking at all.

This is by no means unusual: tons of products only work if you get the exact configuration that the developers selected.  That’s part of why I always have an exact copy of production in the first place!

What do you do about it?

If you’re an individual, file bugs. That’s the best thing you can do to help the implementors.  Giving them a test sequence that they can add to their automated regression tests is a friendly act, so they won’t ever make the same mistake again. Be open to testing their fixes, and don’t just label them bozos (;-))

If you’re a contributor, have a matrix of stuff you need to at least smoke-test against. Making friends with beta-testers and bug reporters can give you a huge list of platforms that have been debugged, and you might talk them into running a corn job that makes them part of your nightly build-and-test farm.

You’re not going to have to solve  the “every possible brand of modems” problem if you can find a community of people who care enough to report bugs and test fixes. You don’t actually ever have to test against everything, just the things people actually use. And if the people depend on your software, they’re motivated to want to have it run well. Recruit them!