Time Budgeting

If you’re concerned about the performance of your code, there is a classic performance-engineer’s approach to managing risk, called a “time budget”.

If you’re inside your budget, you can do capacity planning to get enough CPU and memory resources, at some price. But if you aren’t, throwing money and developer time at a fuzzy performance problem isn’t going to give you anything in the way of guarantees

Introduction

The time budget is how long a single-threaded transaction takes on a uniprocessor. If you ensure your code stays within it, you can calculate how fast it will run on a multiprocessor with a known clock speed.

It’s just a unit test, part of the suite tests I use to validate other functional attributes of programs

On my replay load tester, Play it Again, Sam, it looks like this:

2020/07/26 07:19:23 main_test.go:86: Get took 0.100412 seconds, within 101ms
PASS
ok github.com/davecb/Play-it-Again-Sam/cmd/runLoadTest 0.104s

Once you have run time measured and budgeted for, a load test will then show you how close you can come to it on your production machines, and will tell you the degree to which limited CPU or memory will interfere.

How to do it

As soon as you have block of code that accepts work, write a single-request test framework to call the module and times how long it takes. In Go, that’s easy, it’s just a unit test.

Then write a dummy service that the code calls, that does nothing except wait 100 milliseconds.

Run the test framework, and see if we get a single transaction completed in 100 milliseconds plus the time you’ve budgeted.

Let’s say the code should take one millisecond. If the time-budget test takes less than 101 milliseconds, STOP! Premature optimization is the root of all evil.


If the transaction previously took 100.6 milliseconds, but just jumped to 103, then run the profiler and ask what just changed.

My worked example

Let’s say my total budget to read a record, run the load-test and return a result is one millisecond. The test looks like this when run:

$ go test
#yyy-mm-dd hh:mm:ss latency xfertime thinktime bytes url rc op offered
2020-07-26 14:54:13.491 0.100134 0.000000 0 0 /download/images/15b00a26-9ba3-4649-8477-c48bcab90dc7_180_1000_False_xqualityx.jpg 200 GET 1 
2020/07/26 14:54:13 runLoadTest.go:410: #date      time         name        pid  utime stime maxrss inblock outblock
2020/07/26 14:54:13 runLoadTest.go:411: 2020-07-26 14:54:13.491 RunLoadTest 21185 0.001673 0.002502 23552000 0 0
2020/07/26 14:54:13 main_test.go:42: Get took 0.100518 seconds, within 101ms
PASS
ok  	github.com/davecb/Play-it-Again-Sam/cmd/runLoadTest	10.106s

It’s implemented as a simple test for elapsed time

// budgetTest sees if we can complete quickly enough
func budgetTest(debug bool, t *testing.T) {

    initial := time.Now()
    systemUnderTest(debug)
    totalTime := time.Since(initial)
    if totalTime >= budgetedTime {
        t.Error(fmt.Sprintf("Get took %f seconds, more than %v\n",
            totalTime.Seconds(), budgetedTime))
    } else {
        log.Printf("Get took %f seconds, within %v\n",
            totalTime.Seconds(), budgetedTime)
    }
    time.Sleep(10 * time.Second) // let pipes drain
} 

The initial := time.Now() to time.Since(initial) measures how much longer than 100 milliseconds the action takes.

In Play it Again, Sam, the dummy service is a “protocol” different from REST or S3. This one is a dummy communications protocol that waits exactly 100 milliseconds and then returns. Any time taken above the 100 milliseconds is overhead in the load-testing program, the very thing we want to measure.

// Get does a GET that should take one tenth of a second
func (p timeBudgetProto) Get(path string, oldRc string) {
    initial := time.Now()
    // wait a tenth of a second 
    time.Sleep(100 * time.Millisecond) 
    latency := time.Since(initial) 
    totalTime := latency 
    transferTime := 0 
    reportPerformance(initial, latency, transferTime, 
        []byte(""), path, http.StatusOK, oldRc) 
    close(alive) // This forces an immediate exit
} 

Subtletys

It’s easy to make the test fail: running it with verbose logging is sufficient, as I/O is more expensive than computation. When experimenting with old code where I don’t have a performance target, I often set the budget to just allow me to pass with –verbose turned off, but not with verbose logging turned on.

It’s also easy to get it to fail by running set-up code inside the timed part. That makes it sensitive, for example, to whether the file system cache is “warm” with any data that the program needs. In the case of Sam, that’s a .csv file of input data. Running a number of other test before a time budget test can help you to avoid false positives, but refactoring your code so you can start timing after initialization is better.

Finally, if you’re going to run this in a CI/CD system, you’ll want to have an externally settable budget, and then arrange to always run the CI step in a machine with a known performance. Setting your cgroups or your CI server so you always have the entirety of a single cpu core and enough memory will keep a budget test from turning into a source of false negatives. Requiring it pass N tries out of M is also good, if your CI system supports such a thing.

Conclusions

Making a budget and taking steps when you exceed it is good advice for humans: I also recommend it to your programs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s