One of the most elegant ways to performance-test a program is to replay actual work that it did in the past. In these days of “everything is REST over HTTP”, it’s also one of the easiest.
Once upon a time, a customer of mine needed to replace a storage array (the same one I wrote about filling, earlier). They needed to know how it was performing in order to plan for its replacement.
The old array had a web server proxing for it, and was running the default logging. We took one look and congratulated ourselves: it recorded every transaction, whether it was a get, put, post or delete, and what the full path to the file was. The only thing we lacked was the response time of each transaction, and that was an optional item that either nginx or apache could log.
We added ” $request_time” to the end of the log lines, and started collecting, in a week when the business could tell us that they weren’t doing anything unusual. We eventually selected a two-hour period with a perfectly normal range of loads, and extracted the logs.
One Stone, Three Birds
With this we can do at least three things
- see if a future system will handle the load, with real file sizes and repetition rates so we can measure caching, too.
- know what load and performance we have now, and its variability, and
- have good data to load-test a future system.
The first need is to be able to tell if a planned future system would be able to handle the load.
Let’s pretend that I planed to use a small server with the same extra-quiet 3600/7200 rpm disks as my backup server at home. If I want to know what one a single disk will handle, I only need to replay the log from the old machine on the new ones, stepping up the load in TPS (transactions per second, requests per second) and recording how long each transaction takes at a given load in TPS.
On my machine, named Calvin, the performance curve looks like this. The X axis is load in requests per second, and the Y axis is the time it takes. In this diagram, higher is worse, as what we want is short response times (ie, fast responses).
The average response time hits 0.5 seconds at about 50 requests/second, which is roughly the speed of the old system at its usual load of 200 request/second.
For my server to hit 200 TPS at 0.5s response time, I’d need 4 disks of the same type. Maybe five, to be safe.
This is exactly the process my customer used to confirm that their proposed servers would be able to do the job:
- measure the load and performance directly from the logs, then
- replay the observed load in the lab, and measure how many disks we’ll need to get the required performance.
As it happens, we had planed for lots of disks to be able to store lots of data, so we knew we would have better performance that the old system at the same load. And we could take more load without ever being slower than the old system. A win both ways.
What about “Normal Overload”?
Of course, more load is what always happens, especially if the company is successful. What happens if we grossly overload the disks?
Well, we can find that out by replaying the recorded load faster and faster, until the disk “hits the wall”. When it does, it will show a hockey-stick curve when we plot load versus response time.
This is what Calvin shows when we run the load up well past anything sane. Sure enough, at around 200 TPS, we see a sharp turn upwards, a classic “_/” curve:
You wouldn’t plan to run those kinds of loads, but it’s a good thing to know that a disk won’t crash at 250 TPS. It might be stupidly slow, but it will continue to work, as will the Linux OS it is running on.
Whether it’s a disk drive or a computation, logs tell us a lot about performance, not just activity. And these days, they’re dead easy to get.