Health and Status Monitoring

November 29, 2009

Service interruptions of digital systems can inconvenience millions of people and have a significant financial impact on the provider. If the Amazon web site, or Google’s Gmail, or the Visa payments network goes down even for a few minutes, it can make front page news.

As digital systems grow larger and more complex, it can become very challenging to monitor their health and status, which is the first step in detecting potential problems, identifying the root causes, and taking appropriate preventive actions. These types of systems can contain thousands of different data feeds, data flows and processes. A problem with just one of them can interrupt payments, ads, and status updates, respectively. Often there are hourly, daily, weekly and seasonal variations in the data that complicates the detection of problems.

One way to gain some insight into this problem is to look at the origins in the 1920’s of statistical quality control. Walter Andrew Shewhart (1891 – 1967) was an engineer at the Western Electric Company, which manufactured hardware for the Bell Telephone Company, from 1918-1924. From 1925 to 1956 he was a member of the Technical Staff of Bell Telephone Company [ASQ].

One of the problems that concerned him was identifying potential problems in factory assembly lines. For example, the dimensions and weight of metal parts that are sampled from an assembly can be recorded. He distinguished between two types of variations in these measurements:

  • Common cause of variation (or noise) occurs as a normal part of the manufacturing process.
  • A special cause of variation is not part of the normal manufacturing process, but represents a problem.

One of the goals of statistical quality control is to distinguish between these two types of variation and to quickly identify special causes of variation.

Shewhart introduction control charts as a tool for distinguishing between common and special causes of variation. A control chart had a central line and upper and lower control limits. When the measurement exceeded either the upper or lower control limits, it was considered a potential special cause of variation and investigated. Usually, the upper and lower control limits were three standard deviations above and and below the mean.

As anyone who has investigated potential data quality problems knows, identifying roots causes of potential problems is not easy and Shewhart also introduced a four step approach to these types of investigations that became known as the Shewhart Cycle, the Deming Cycle or the Plan-Do-Check-Act Cycle:

  • Plan. Identify an opportunity or potential problem and make a plan for improving it or changing it.
  • Do. Implement the change on a small scale and collect the appropriate data.
  • Check. Use data to analyze statistically the results of the change and determine whether it made a difference.
  • Act. If the change was successful, implement it on a wider scale and continuously monitor and improve your results. If the change did not work, begin the cycle again.

These same ideas are still used today as the basis for health and monitoring systems. Well designed digital systems these days are designed from the ground up so that appropriate log data is produced. Instead of a single assembly line producing physical items, there are thousands or millions of digital processes producing (nearly) continuous digital data. Often this data is available through an http interface and is continually collected.

This is dashboard from the open source Augustus system for health and status monitoring.

Instead of a control chart, a change detection model is used, such as a CUSUM or GLR statistical model [Poor]. Instead of building a single model, a model for each cell in a multi-dimensional cube of models is built [Bugajski]. Instead of looking at the charts each day, an online dash board is used that is at the hub of an operations center.

Baseline and change detection models for each cell in a multi-dimensional data cube of models can be built easily using the open source Augustus system.


[ASQ] ASQ, The History of Quality – Overview, retrieved from

[Bugajski] Joseph Bugajski, Chris Curry, Robert L. Grossman, David Locke and Steve Vejcik, Data Quality Models for High Volume Transaction Streams: A Case Study, Proceedings of the Second Workshop on Data Mining Case Studies and Success Stories, ACM 2007

[Poor] H. Vincent Poor and Olympia Hadjiliadi, Quickest Detection, Cambridge University Press, 2008.

The Power of Predictive Analytics: Creating New Markets (Part 1)

November 12, 2009

Predictive models have the power to create new markets, a power that is not all that common in technology. This is the first of several posts that contain case studies describing how companies have used predictive analytics to create new markets.

Motorcycle Insurance. For many years, it was difficult if you drove a motorcycle to get insurance. Drivers of motorcycles have more accidents than other drivers and the simplest course of action is simply not to insure them. For most drivers, simply knowing a few facts about them, such as their gender, age, and the number of miles driven to work, was enough information so that an insurance company could set premiums for an insurance policy.

Classic Black Motorcyle

This segment of the automobile insurance market was called the “standard segment” and includes 80%-90% of the market. The other segment, called the “nonstandard segment”, includes drivers with accidents, drivers of motorcycles and high performance cars, older drivers, and younger drivers. Most insurance companies in the 1950’s – 1980’s focused on the standard segment. The standard market was quite competitive during this period.

Progressive Insurance took a different tack in the 1970’s. It developed an analytic model that could quantify the risk of some who drove a motorcycle and then priced the policy accordingly. Motorcycle insurance was part of the nonstandard segment and there was much less competition in this segment. This segment also had a higher barrier to entry since pricing premiums (well) in this market required (simple) analytic models.

By developing an appropriate risk model Progressive was able to create a new market, which became an important driver of its growth in the 1970s. From 1975 to 1978, premium income grew from $38 million to $112 million, as Progressive solidified its leadership in the nonstandard market.

Source: The Progressive Corporation,

Online Text Ads. Google introduced their online text ads (to the right of search results) in January 2000. The ads were sold on a cost per thousand impressions (CPM) by a sales representative. This was the way most ads were sold at that time, although banner ads (not text ads) were the dominant form of online advertising. These ads didn’t generate a lot of money at the beginning.

In the Spring of 2000, the online banner ad market crashed. In response, Google changed its business model to a self-serve model. With a self-serve model, ads were not sold by a sales representative but instead through an online, self-serve web page. It got this idea from (which later became Overture, which later was bought by Yahoo!).

In October, 2000, Google introduced AdWords with the slogan: “Have a credit card and 5 minutes? Get your ad on Google today.” Ads were still priced by CPM, but there was no longer a sales representative. (By the way, Amazon used the same model in August, 2006 when it introduced EC2. With a credit card and 5 minutes you could get use an online computer and pay by the hour.)

In 2001, Google’s AdWords revenue approached $85 million, but was much less than Overture’s revenue which earned $288 million. In contrast to Google’s use of a CPM model, Overture used an auction model: the higher you bid for an ad the more likely your ad would appear.

A problem with the auction model as employed by Overture was that high bids could force an ad to the top but no one necessarily clicked on it unless it was relevant. Relevance using information extracted from text was something Google understood well.

Google built a predictive model (a response model) to predict whether a given user would click on a given ad. Google then integrated this response model with the rankings provided by the auction. Ads with higher expected responses would be moved up higher in the rankings, and those with lower expected responses would be moved lower in the rankings. Ads with the highest rankings would then be displayed. This new model that integrated an online auction with a response model was introduced into AdWords in February 2002.

From a modeling perspective, Google has introduced two disruptive technologies in modeling: 1) PageRank; and 2) integrating relevance through a response model into pay-per-click auctions for online ads.

With this integrated model, Google created a new market (online text ads) that it has dominated since 2002.

Source: John Battelle, The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture.

Health and Status Monitoring. Over the past several years, Open Data has developed predictive analytic models to monitor the operations of complex systems, such as data centers, network operations centers, world wide payment systems, etc. I’ll describe these types of models in a future post.