Health and Status Monitoring

Service interruptions of digital systems can inconvenience millions of people and have a significant financial impact on the provider. If the Amazon web site, or Google’s Gmail, or the Visa payments network goes down even for a few minutes, it can make front page news.

As digital systems grow larger and more complex, it can become very challenging to monitor their health and status, which is the first step in detecting potential problems, identifying the root causes, and taking appropriate preventive actions. These types of systems can contain thousands of different data feeds, data flows and processes. A problem with just one of them can interrupt payments, ads, and status updates, respectively. Often there are hourly, daily, weekly and seasonal variations in the data that complicates the detection of problems.

One way to gain some insight into this problem is to look at the origins in the 1920’s of statistical quality control. Walter Andrew Shewhart (1891 – 1967) was an engineer at the Western Electric Company, which manufactured hardware for the Bell Telephone Company, from 1918-1924. From 1925 to 1956 he was a member of the Technical Staff of Bell Telephone Company [ASQ].

One of the problems that concerned him was identifying potential problems in factory assembly lines. For example, the dimensions and weight of metal parts that are sampled from an assembly can be recorded. He distinguished between two types of variations in these measurements:

  • Common cause of variation (or noise) occurs as a normal part of the manufacturing process.
  • A special cause of variation is not part of the normal manufacturing process, but represents a problem.

One of the goals of statistical quality control is to distinguish between these two types of variation and to quickly identify special causes of variation.

Shewhart introduction control charts as a tool for distinguishing between common and special causes of variation. A control chart had a central line and upper and lower control limits. When the measurement exceeded either the upper or lower control limits, it was considered a potential special cause of variation and investigated. Usually, the upper and lower control limits were three standard deviations above and and below the mean.

As anyone who has investigated potential data quality problems knows, identifying roots causes of potential problems is not easy and Shewhart also introduced a four step approach to these types of investigations that became known as the Shewhart Cycle, the Deming Cycle or the Plan-Do-Check-Act Cycle:

  • Plan. Identify an opportunity or potential problem and make a plan for improving it or changing it.
  • Do. Implement the change on a small scale and collect the appropriate data.
  • Check. Use data to analyze statistically the results of the change and determine whether it made a difference.
  • Act. If the change was successful, implement it on a wider scale and continuously monitor and improve your results. If the change did not work, begin the cycle again.

These same ideas are still used today as the basis for health and monitoring systems. Well designed digital systems these days are designed from the ground up so that appropriate log data is produced. Instead of a single assembly line producing physical items, there are thousands or millions of digital processes producing (nearly) continuous digital data. Often this data is available through an http interface and is continually collected.

This is dashboard from the open source Augustus system for health and status monitoring.

Instead of a control chart, a change detection model is used, such as a CUSUM or GLR statistical model [Poor]. Instead of building a single model, a model for each cell in a multi-dimensional cube of models is built [Bugajski]. Instead of looking at the charts each day, an online dash board is used that is at the hub of an operations center.

Baseline and change detection models for each cell in a multi-dimensional data cube of models can be built easily using the open source Augustus system.

References

[ASQ] ASQ, The History of Quality – Overview, retrieved from http://www.asq.org.

[Bugajski] Joseph Bugajski, Chris Curry, Robert L. Grossman, David Locke and Steve Vejcik, Data Quality Models for High Volume Transaction Streams: A Case Study, Proceedings of the Second Workshop on Data Mining Case Studies and Success Stories, ACM 2007

[Poor] H. Vincent Poor and Olympia Hadjiliadi, Quickest Detection, Cambridge University Press, 2008.

Advertisements

One Response to Health and Status Monitoring

  1. Tal Galili says:

    Hi there,
    I would love to add your blog feed to:
    http://www.r-bloggers.com

    In order to do that I would need:
    1) your permission. and,
    2) A tag/category with which you use to always mark posts you write that reference “R” (I found that you don’t always tag your posts with R šŸ˜¦ )

    Please let me know in reply to my e-mail,
    Sorry for leaving a comment instead of e-mailing you, I couldn’t find on this blog how to contact you (consider changing that šŸ™‚ )

    Best,
    Tal

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: