Measuring Apples and Oranges

Sometimes you really do have to measure two very different things. Recently I needed to compare an old batch program with a new Golang one, to see how much we’d improved.

Introduction

The problem sounded simple, but the batch program was multi-process, while the go program was multi-threaded. Hundreds of batch children would come into existence, run to completion and die before the go program exited.

When I looked at process accounting with dump_acct, there were hundreds of records for batch, but none from the go program yet.

And the whole problem was “racy”. If I tried to measure a 10-minute period using /proc calls, almost all the batch children that were running when I started would have exited. By then they’d be replaced with a whole new collection of batch children that I wouldn’t be measuring

Measuring

The best answer would be to add measurement code to both programs, but I only had an afternoon.

The second-best was to take a snapshot of the two programs for a much shorter time and note how many batch children ran versus how many exited.

A snippet of my spreadsheet looked like this:

namepidcputimetotals


golang9000.770.77


batch31651exited71.47
rows102
batch31672exited

exited22
batch31876exited



batch31874exited



batch31696exited



batch215213.5



batch31887exited



batch31642exited



batch31680exited



batch319820.93



batch31877exited



batch31698exited



batch321930.86



batch320620.9



batch320550.87



102 children were running when I started measuring, and thirty seconds later when I finished, 22 had exited. Probably another 22 had started up, for a likely margin of error of 44/102 or 43-odd percent

However, the batch program used many times as much CPU as the Go program, so even with the maximum margin of error, we knew we were on the right path.

Tooling

The program we used was tiny, written in a few hours using one of the many /proc libraries available for go.

A loop read the names of the programs, searched /proc for their pids, and started a goroutine for each pid.

Each goroutine took a measurement, slept for 30 seconds and then reported.

All the serious work was in the analysis of error: knowing how racy the situation was, could we find a period that was representative enough, but didn’t see the death of too many children.

A good value turned out to be 30 seconds, half the typical life of a batch child (ie, we were sampling at twice the frequency of the thing being measured)

The program is available on github, as https://github.com/davecb/sampleProc, and it’s a good start at measuring apples and oranges. For roundness, not color.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s