As you may know from my old blogs, I’ve often done capacity planning, and generally recommend it as a pain-avoidance tool.
However, I was just reading a blog (not by one of my customers!) about how much pain they went through when they didn’t have enough storage performance, and it struck me it should take them about an hour to turn a pain-point into a pain-avoidance plan. So this is how.
A company I follow recently decided that they should stay with cloud storage, which was interesting, but the most interesting thing was they pointed out what made them consider having their own storage: every time load got high, the write journal times went from 2 seconds to forty or more.
Now, if you happen to be doing anything where human beings wait for you, forty seconds is bad. Really bad. Twenty to thirty seconds is the timeout point for human short-term memory. After that long, many of us will have completely forgotten what we were doing. With me, I’d probably assume it had taken even longer, and conclude “my supplier has crashed”, and start wondering if this was another Amazon-S3-crash.
You can imagine what kind of pain they were in!
However, they also have graphs of the load at the same time, which means that they can calculate one value that will be immensely useful to them: how much their storage slows down under load.
In a similar but read-heavy scenario, I plotted read times against load, and got a scattergram with three distinct regions:
The first was below about 100 IOPS, where the response time was quite low, as there were a relatively small number of request that came in at the same instant as another and had to wait. Above 100 I/O operations per second, we start having a lot of requests coming in at the same time and slowing down. By 120, we’re starting to see huge backups, with request sitting in queue for 30 seconds and more before they could get a chance to go to the disk..
Response times vs load always form a “hocky-stick” curve, technically a hyperbola, and can be plugged into a queue modeller like pdq to get a good estimate (the solid line) If I had a lot more data points at 110-140 IOPS, the scattergram would have shown a definite “_/” shape.
This the thing you need to avoid the pain: the slowdown curve. Once you know it, you can plan to avoid being at the wrong point in it.
If you have ever had a major slowdown, as the bloggers did with their journal writes, ask yourself: do you have the load from the same time period?
If you do, an ordinary spreadsheet will give you a scattergram of slowness vs load, and you can draw the hocky-stick curve by eye. Spreadsheets will draw exponentials, bot that’s nothing like accurate enough: your eyes will do better.
Now you know what to avoid, and the pain you suffered has been turned into data that can help you never have the problem again.
Know that curve and resolve to avoid the bad parts, forevermore!