Measurement Heisenbugs

One of my smarter colleagues once said “your metrics are making the program impossibly slow. Turn some off”

Introduction

I initially thought he was quoting the line from Amadeus, where the Emperor tells Mozart there are too many notes in his music, and to take some out.

In fact, he was diagnosing a bug that our metrics package suffered from: the load on the machine rose with the number of metrics collected, not with the number of queries for them. Metrics collection therefor slowed down the program it was observing, a Heisenbug!

My background is with Solaris and Linux kernel stats, so I was flabbergasted. Metrics should only use the time of the person querying them, never the time of the program that the metrics are measuring.

Kernel Stats and “stat” tools

You may notice that tools like vmstat, iostat and mpstat expect you to provide an interval between samples. If you don’t, they return the average since the last reboot.

That’s because the proper way to measure things like user CPU time is to fetch the current values, wait one second and then fetch the new values. The vmstat program then reports on the difference, in this case user CPU time, during one second.

Kernels are performance-sensitive, and they carefully collect metrics in a way that does not slow them down. For example, the dispatcher collects user and system time in unlocked per-CPU counters. If the counters were shared across multiple CPUs, they would have to be locked to keep them consistent, and every use would cause a flurry of cache-line and memory hits, dragging the performance of the operating system into the dust.

When a user program like vmstat asks for them, it makes a system call and the kernel reads all the per-CPU counters, still unlocked, and returns them. The kernel does have to be a bit careful about getting good values, but most counters are 64 bits, and easy to read without returning a value that’s in the middle of being changed.

Vmstat then waits, repeats its call, computes the difference and formats that for the human observer.

The result is that the kernel always runs at full speed, incrementing counters as it goes. The total cost varies linearly with the number of counters, and is pleasantly low. So low, in fact, that Solaris could implement per-process nanosecond-level measures of cpu-queue residency. That requires a metric update every single time the dispatcher changes any process’ state. Rather often.

The kernel never pays the cost of rolling up the stats: that’s something the observing program should do. A system call or a read of /proc uses a bit of kernel time to return the metrics, but the hard work happens in the program that asked for them.

Doing the same in Go

Let’s say we want to report requests, errors and response time for REST queries to a Go program, the common “RED” metrics. We also want to do so without slowing down the Go program.

Se we structure the code into two parts. One is an additional “/metrics” entry point that reports the current totals,

    r = r.Now()
    return fmt.Sprintf("%d, %d, %fs", 
        r.Requests, r.Errors, r.Duration.Seconds())

The cost to the program is a bit high, because it uses fmt.Sprintf() and time.Seconds(), but if we’re not constantly calling it, is acceptable. Once a minute sounds entirely reasonable. Maybe a bit less reasonable if we call it every 10 seconds when we’re debugging something, but then we’re expecting the Heisenberg effect.

The second part is collecting the data. We add a “red” struct to a convenient struct, initialize it with the current time in a hidden field and update it in our server.

func (s *Server) doSomething(req *Request) *Response {
    s.red.Requests++
    err := prepareJSON(&req)
    if err != nil {
        log.Printf("bad request: unable to prepare json: %s", err)
        s.red.Errors++
        return nil
    }
    ...

Now when the /metrics entrypoint is called, it will report the number of request and errors since the program started, and the Now() function will do a time.Since() to update the Duration.

If we then copy vmstat to create a “redstat” program, we can do substantially what the operating system does: update metrics at negligible cost, and have a separate program pay that cost, and then only when we want to view or plot those stats.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s