A previous customer needed to replace a program before a 32-bit counter turned over, a simple task… However, the 2^32 links pointed to 2^32 large files, so suddenly we were doing a large-data migration.
Before the new service can be used, we have to do all of
- get the current working set onto the new service
- arrange to get the rest, the stuff that isn’t being used this moment, but will be
- don’t overload the link, or it will cost us money, and
- allow fail-back when the new system has teething troubles.
This would be easy if the new service was a database, they do this all the time. So we blatantly copy what a DBA would do.
We start out populating the new service with the working set of the old. In part, this is so we can do like-with-like comparisons and tests before we risk the business on it. Load tests, as I mentioned before in this blog series, are a good way of not planning to fail.
It looks like this
We start pulling from old to new, where old is the master:
For every request to the old system, we capture the text of the get from the access log and ship it to the new system, where we replay it. The new system sees it lacks the image and requests it from the old system, which typically returns it instantly from its cache.
For every upload to the old system, we do much the same:
We rewrite the puts from the access log into requests, and replay them on the new system, causing another transfer from the old’s cache to the new system.
At this point, we are starting to collecting a representative sample of real work to the new system, and can start running validation tests on it. For one thing, we can replay the last few minutes’ activities from the old system, know that we have the files locally, and measure how fast the new system is with real work.
Of course, just having the working set isn’t enough. Now we have to get all the other files, too.
To do that, we can take a list of files from the old system, any time after the working-set transfer has started, and know that when we are done transferring them, we’ll have everything.
Of course, this is where the network bandwidth comes in. We have to feed the list of files to the new system at a controlled and strictly moderate rate. A load tester does this surprising well: it’s just unusual to set it to produce one TPS instead of ten thousand.
After a few centuries, both systems are in sync. If we list the files on both, they’ll match. Well, except for the very last few, which are probably still in flight.
Now we start turing things around backwards.
We start writing to the new system, and as soon as we get a new file, we write it to the old system. We still have the PUTs being sent from old to new, so we can check to make sure the images have arrived, and pick up any metadata that the old system still uses, such as the 32-bit identifier that we’re running out of.
Finally we start reading from the new system, and the process is complete.
The old system is still running, at least until it runs out of IDs, and it’s in sync with the new, so we can fail over to it any time we need to do additional work on the new, or if there’s a fiasco in the new data centre.
We just did a single-master, fail-over “database” using little more than scripts.
The reasons it was possible were
- both old and new systems used REST, and
- both supported fetching if a file was absent, as that was our mechanism for reloading from our archives when a file was damaged or was updated
Not every system in the world fits this model, but the pattern does fit: if we know how to solve this problem in the database world, we can solve it in a different world.
Mathematicians call this “strength reducing”. If you figure out how to turn a new problem into an old, solved one, you’ve solved the new one. QED!