Health and Status Monitoring

November 29, 2009

Service interruptions of digital systems can inconvenience millions of people and have a significant financial impact on the provider. If the Amazon web site, or Google’s Gmail, or the Visa payments network goes down even for a few minutes, it can make front page news.

As digital systems grow larger and more complex, it can become very challenging to monitor their health and status, which is the first step in detecting potential problems, identifying the root causes, and taking appropriate preventive actions. These types of systems can contain thousands of different data feeds, data flows and processes. A problem with just one of them can interrupt payments, ads, and status updates, respectively. Often there are hourly, daily, weekly and seasonal variations in the data that complicates the detection of problems.

One way to gain some insight into this problem is to look at the origins in the 1920’s of statistical quality control. Walter Andrew Shewhart (1891 – 1967) was an engineer at the Western Electric Company, which manufactured hardware for the Bell Telephone Company, from 1918-1924. From 1925 to 1956 he was a member of the Technical Staff of Bell Telephone Company [ASQ].

One of the problems that concerned him was identifying potential problems in factory assembly lines. For example, the dimensions and weight of metal parts that are sampled from an assembly can be recorded. He distinguished between two types of variations in these measurements:

  • Common cause of variation (or noise) occurs as a normal part of the manufacturing process.
  • A special cause of variation is not part of the normal manufacturing process, but represents a problem.

One of the goals of statistical quality control is to distinguish between these two types of variation and to quickly identify special causes of variation.

Shewhart introduction control charts as a tool for distinguishing between common and special causes of variation. A control chart had a central line and upper and lower control limits. When the measurement exceeded either the upper or lower control limits, it was considered a potential special cause of variation and investigated. Usually, the upper and lower control limits were three standard deviations above and and below the mean.

As anyone who has investigated potential data quality problems knows, identifying roots causes of potential problems is not easy and Shewhart also introduced a four step approach to these types of investigations that became known as the Shewhart Cycle, the Deming Cycle or the Plan-Do-Check-Act Cycle:

  • Plan. Identify an opportunity or potential problem and make a plan for improving it or changing it.
  • Do. Implement the change on a small scale and collect the appropriate data.
  • Check. Use data to analyze statistically the results of the change and determine whether it made a difference.
  • Act. If the change was successful, implement it on a wider scale and continuously monitor and improve your results. If the change did not work, begin the cycle again.

These same ideas are still used today as the basis for health and monitoring systems. Well designed digital systems these days are designed from the ground up so that appropriate log data is produced. Instead of a single assembly line producing physical items, there are thousands or millions of digital processes producing (nearly) continuous digital data. Often this data is available through an http interface and is continually collected.

This is dashboard from the open source Augustus system for health and status monitoring.

Instead of a control chart, a change detection model is used, such as a CUSUM or GLR statistical model [Poor]. Instead of building a single model, a model for each cell in a multi-dimensional cube of models is built [Bugajski]. Instead of looking at the charts each day, an online dash board is used that is at the hub of an operations center.

Baseline and change detection models for each cell in a multi-dimensional data cube of models can be built easily using the open source Augustus system.

References

[ASQ] ASQ, The History of Quality – Overview, retrieved from http://www.asq.org.

[Bugajski] Joseph Bugajski, Chris Curry, Robert L. Grossman, David Locke and Steve Vejcik, Data Quality Models for High Volume Transaction Streams: A Case Study, Proceedings of the Second Workshop on Data Mining Case Studies and Success Stories, ACM 2007

[Poor] H. Vincent Poor and Olympia Hadjiliadi, Quickest Detection, Cambridge University Press, 2008.