Detecting anomalies sounds simple: all you have to do is define what’s normal, and look for what isn’t. Of course, exhaustively defining what’s normal could be just a little hard.
When working as a sysadmin I keep having to detect business anomalies. Alas, the usual tools sysadmins have available aren’t designed for that. Graph packages will alert you if a graph falls to zero or exceeds a gazillion, but that’s not what I need to do.
Imagine I’m running EBay. To me, an anomaly is a vendor whose sales are cut in half. Having them fall to zero at midnight isn’t an anomaly, and neither is them increasing and decreasing through the day. Like most businesses, their sales rise and fall in a regular pattern during the day, are especially low (or high!) on weekends, and can spike or crater suddenly when a major sporting event attracts the attention of all their customers, depending if they sell sports stuff or not.
My job is to detect anomalies, and then find out if we’ve caused them, such as by rolling out a buggy release. If we have caused them, we need to do a rollback. If we haven’t caused them, we need to alert the business, so they can call the customers and ask if they’ve rolled out a buggy release.
If I don’t have a good way to detect anomalies, then I’m forced to fall back to the traditional accountant’s tactic of comparing today’s business with the same day last week. That’s a good tactic, but it usually involves looking at hourly or daily samples of data. Hourly data is often delayed by several hours by the “big data” processes used to roll it up to usable samples, and daily data can be delayed for … a day.
The Tools Problem
If you look for anomaly detection programs, you’ll be buried in a cascade of products, heavily featuring machine learning, intended to learn what’s normal and what isn’t.
Developers and researchers have produced classification and clustering systems, heuristic approaches using “nearest neighbor” analyses and theoretical ones using information theory. However, one of then very best and easiest to understand is statistical. And old.
It’s the “Western Electric” rules, written by the company Bell created to manufacture telephones. They were created to detect problems that their quality control systems couldn’t otherwise handle, and is considered one of the classic works in statistical quality control.
Western Electric’s Solution
Start with an average and standard deviation, and look at the data points as they arrive. If one of them is more that three standard deviations below the average, that’s an anomaly. A rather glaring one, in fact.
It’s also not what normal monitoring tools alert on. They detect 100% CPU or 0% memory, because that’s what their authors think are the only things that sysadmins need to do.
Borrowing a diagram from Wikipedia, a point that is “three sigma below the mean” usually means something’s wrong, such as the point in the lower right corner of the diagram above.
In this diagram, the region labeled “A” is the range between two and three standard deviations above or below the average. As you can see, the last sample is well below the lower “A”.
Statistically, 99.73% of “normal” data will fall within three standard deviations of the average. Only 0.17% won’t.
In turn, that means that this data point is something I should pay attention to. And yes, that also means that statisticians have also defined what normal is. They assume a bell-curve, and name it the “normal distribution”.
In my EBay example, this would be something like selling a year-old pickup truck for $100. Not impossible, but definitely anomalous.
Extending it to Groups
The Western Electric rules also consider groups of points, in order to detect streams of points approaching the limits:
- two points out of three at +/- 2 sigma, on the same side as the average, and
- four out of five at +/-1 sigma, also on the, same side
In the EBay case, this could be a vendor whose daily sales were growing (good!) or falling off abruptly.
Once we see an anomaly, we first make sure we didn’t cause it, and then reach out to the vendor to see if they are aware of it, and if there is anything we can do to help.
Adapting it to Changing Data
In classic WE, we add the new points to the average, probably daily, and then break it up into convenient sets of samples for plotting by hand on a piece of graph-paper.
If an engineer in the graph-paper-and-pencil era was looking at a problem in real-time, they would probably draw a graph for the previous day, and then start adding points to the right-hand side. Of course, that means they’re assuming that the average wasn’t changing over time.
A better approach would be to recompute the average for a day each time you add a new data-point. That’s painful by hand, but then we don’t plan to do it by hand.
Instead of using a fixed sample to compute averages and standard deviations from, we can use a moving average. There is a good algorithm for doing those by Knuth and Welford, mentioned in Knuth (Vol 2, 3rd ed, pg 232)
How many samples to use when calculating averages and standard deviations will depend on how much your data varies during the day.
Mine, for example, tends to look like this:
There is far less business in the night-time, and a sine-wave pattern overall. I need to be careful to not use so long a period to average that I ended up comparing nighttime to daytime data.
There is a good trick, though, for avoiding getting it too short. I chose this particular day because it had a spike at 2:20 PM, which we knew was a problem. If it doesn’t show up in my results as an anomaly, then I know that I have shortened by sample period by too much!
If your data is something you can take a meaningful average and standard deviation from, apply the WE rules. I once had to write then in COBOL (don’t ask), so they’re not hard.
I’m using Go these days, so I wrote a filter in Go at https://github.com/davecb/WesternElectric, to read a stream of data and produce data for a plot. You can feed it to a package like Grafana or Datadog and see exceptions like this: