On performance projects, I’m usually looking for bugs or bad decisions that have cost my customers orders of magnitude of performance. This week, I needed to quantify the value of 0.01 second.
A customer had an old program that responded to a REST query in a tenth of a second and a brand-new version that responded in 0.09 seconds. The team lead was wondering if the new version had paid for itself.
As soon as she told me how many copies of it she had, I told her to stop worrying. She’d just put off their every-three-years hardware uplift for a year. A year’s interest on the capital cost of a machine room was a lot more than her team’s salary. In fact, I said she should probably ask for the team to be recognized by management for their good work.
How do you get there from here?
If you plug the two programs into a model-builder like PDQ, you get a graph that looks like this: The old program, “project tortoise” took 0.1 second to complete. Which was particularly handy, as I can do the calculations for that response time in my head. There were many users making request of a roomful of multi-core machines, and the time between their making any two requests averaged out to about a second.
Plug that into PDQ, and you get the blue line, a “hockey-stick” curve (really a hyperbola) that starts out nice and flat, but curves evilly upward. Upward is slower, in this graph of response time, and therefor worse.
Plug the 0.09 second program in, and you get the orange line, which we’ll call “project hare” (that’s not its real name). As you can see, the hare starts off faster, doesn’t start to slow as soon, and gets farther and farther ahead as the program is put under more and more load.
For planning purposes, the company used 0.2 seconds as an upper bound. That’s a good choice, as the response time usually doubles around the point that something in the system is getting close to being a bottleneck. In this case it’s a disk array, and is observed to be at 80% load whenever the response time is averaging 0.2 seconds. After 80%, the array really starts to slow down, so we want to stay away from that point.
Project hare has given us an extra 1.25 TPS per core at a point where the tortise could handle only about 7.5 TPS per core. That works out to be about about 16.7% better, and the services’ growth rate is only about 15% per year. They just added a whole year to the working lifetime of the machines.
Many hands make work light
The improvement didn’t look like much, but because it was applied to a large number of machine with an even larger number of disks, it paid off in cubic yards of money saved.
Having many many processors doesn’t just make work light, it makes every little advance pay off wonderfully.
You could even say, “many hands pay you many times”.