Health and Status Monitoring

November 29, 2009

Service interruptions of digital systems can inconvenience millions of people and have a significant financial impact on the provider. If the Amazon web site, or Google’s Gmail, or the Visa payments network goes down even for a few minutes, it can make front page news.

As digital systems grow larger and more complex, it can become very challenging to monitor their health and status, which is the first step in detecting potential problems, identifying the root causes, and taking appropriate preventive actions. These types of systems can contain thousands of different data feeds, data flows and processes. A problem with just one of them can interrupt payments, ads, and status updates, respectively. Often there are hourly, daily, weekly and seasonal variations in the data that complicates the detection of problems.

One way to gain some insight into this problem is to look at the origins in the 1920’s of statistical quality control. Walter Andrew Shewhart (1891 – 1967) was an engineer at the Western Electric Company, which manufactured hardware for the Bell Telephone Company, from 1918-1924. From 1925 to 1956 he was a member of the Technical Staff of Bell Telephone Company [ASQ].

One of the problems that concerned him was identifying potential problems in factory assembly lines. For example, the dimensions and weight of metal parts that are sampled from an assembly can be recorded. He distinguished between two types of variations in these measurements:

  • Common cause of variation (or noise) occurs as a normal part of the manufacturing process.
  • A special cause of variation is not part of the normal manufacturing process, but represents a problem.

One of the goals of statistical quality control is to distinguish between these two types of variation and to quickly identify special causes of variation.

Shewhart introduction control charts as a tool for distinguishing between common and special causes of variation. A control chart had a central line and upper and lower control limits. When the measurement exceeded either the upper or lower control limits, it was considered a potential special cause of variation and investigated. Usually, the upper and lower control limits were three standard deviations above and and below the mean.

As anyone who has investigated potential data quality problems knows, identifying roots causes of potential problems is not easy and Shewhart also introduced a four step approach to these types of investigations that became known as the Shewhart Cycle, the Deming Cycle or the Plan-Do-Check-Act Cycle:

  • Plan. Identify an opportunity or potential problem and make a plan for improving it or changing it.
  • Do. Implement the change on a small scale and collect the appropriate data.
  • Check. Use data to analyze statistically the results of the change and determine whether it made a difference.
  • Act. If the change was successful, implement it on a wider scale and continuously monitor and improve your results. If the change did not work, begin the cycle again.

These same ideas are still used today as the basis for health and monitoring systems. Well designed digital systems these days are designed from the ground up so that appropriate log data is produced. Instead of a single assembly line producing physical items, there are thousands or millions of digital processes producing (nearly) continuous digital data. Often this data is available through an http interface and is continually collected.

This is dashboard from the open source Augustus system for health and status monitoring.

Instead of a control chart, a change detection model is used, such as a CUSUM or GLR statistical model [Poor]. Instead of building a single model, a model for each cell in a multi-dimensional cube of models is built [Bugajski]. Instead of looking at the charts each day, an online dash board is used that is at the hub of an operations center.

Baseline and change detection models for each cell in a multi-dimensional data cube of models can be built easily using the open source Augustus system.

References

[ASQ] ASQ, The History of Quality – Overview, retrieved from http://www.asq.org.

[Bugajski] Joseph Bugajski, Chris Curry, Robert L. Grossman, David Locke and Steve Vejcik, Data Quality Models for High Volume Transaction Streams: A Case Study, Proceedings of the Second Workshop on Data Mining Case Studies and Success Stories, ACM 2007

[Poor] H. Vincent Poor and Olympia Hadjiliadi, Quickest Detection, Cambridge University Press, 2008.


[, [[, $: R accessors explained

October 21, 2009
R Accessors

R Accessors

For more than ten years, I have been teaching R both formally and informally. One thing that I find often trips up students is the use of R’s accessors and mutators. ( For those readers not from a formal computer science background, an accessor is a method for accessing data in an object usually an attribute of that object.) A simple example is taking a subset of a vector:


letters[1:3]
[1] "a" "b" "c"

As you can see, the result is a character vector containing the first three letters of letters vector.

Good programming languages have a standard pattern for accessor and mutators. For R, there are three: [, [[, and $. This confuses beginners coming from other programming languages. Java and Python have one: ‘.’. Why does R need three?

The reason derives from R’s data centric view of the world. R natively provides vectors, lists, data frames, matrices, etc. In truth, one can get by using only [ to extract information from these structures, but the others are handy in certain scenarios. So much so that after a while, they feel indispensible. I will explain each and hopefully by the end of this article you will understand why each exists, what to remember and, more importantly, when to each should be used.

Subset with [

When you want a subset of an object use [. Remember that when you take a subset of an the object you get the same type of thing. Thus, the subset of a vector will be a vector, the subset of a list will be a list and the subset of a data.frame will be a data.frame.

There is one inconsistency, however. The default in R is to reduce the results to the lowest dimension, so if your subset contains only result, you will only get that one item which may be something of a different type. Thus, taking a subset of the iris data frame with only one column


class( iris[ , "Petal.Length" ] )
[1] numeric

returns a numeric vector and not a data frame. You can override this behavior with the little publicized drop parameter, which indicates not to reduce the result. Taking the subset of iris with drop = FALSE


iris[ , "Petal.Length", drop=FALSE ]

is a proper data frame.

Things to Remember:

  • Most often, a subset is the same type as the original object.
  • Both indices and names can be used to extract the subset. ( In order to use names, object must have a name type attribute such as names, rownames, colnames, etc. )
  • You can use negative integers to indicate exclusion.
  • Unquoted variables are interpolated within the brackets.

Extract one item with [[

The double square brackets are used to extract one element from potentially many. For vectors yield vectors with a single value; data frames give a column vector; for list, one element:


letters[[3]]
iris[["Petal.Length"]]

The mnemonic device, here is that the double square bracket look as if you are asking for something deep within a container. You are not taking a slice but reaching to get at the one thing at the core.

Three important things to remember:

  • You can return only one item.
  • The result is not (necessarily) the same type of object as the container.
  • The dimension will be the dimension of the one item which is not necessarily 1.
  • And, as before:
    • Names or indices can both be used.
    • Variables are interpolated.

Interact with $

Interestingly enough, the accessor that provides the least unique utility is also probably used the most often used. $ is a special case of [[ in which you access a single item by actual name. The following are equivalent:


iris$Petal.Length
iris[["Petal.Length"]]

The appeal of this accessor is nothing more than brevity. One character, $, replaces six, [[“”]]. This accessor is handiest when doing interactive programming but should be discouraged for more production oriented code because of its limitations, namely the inability to interpolate the names or use integer indices.

Things to Remember:

  • You cannot use integer indices
  • The name will not be interpolated.
  • Returns only one item.
  • If the name contains special characters, the name must be enclosed in backticks: “

That is really all there is to it. [ – for subsets, [[ – for extracting items, and $ – for extracting by name.


R: The Dummies Package

September 30, 2009

R-2.9.2 was released in August. While R can be considered stable and battle-ready, it is also far from stagnation. It is humbling to see such an intelligent and vibrant community helping CRAN grow faster than ever. Every day I see a new package or read a new comment on R-Help gives me pause to think.

As much as I like R, on occasion I will find myself lost in some dark corner. Sometimes, I find light. Sometimes I am gnashing teeth and wringing hands. Frustrated. In a recent foray, I found myself trying to do something that I thought exceedingly trivial: expanding character and factor vectors to dummy variables. There must be some function, but what? Trying ?dummy didn’t turn up anything. Surely some else must have encountered this and provided a package. I went to the Internet and sure enough the R-wiki was here to save me. And looking even harder, I found some who had treaded before me on the R-Help archives. It turns out, it’s simple. Expanding a variable as a dummy variable can be done like so:


x <- c(2, 2, 5, 3, 6, 5, NA)
xf <- factor(x, levels = 2:6)
model.matrix( ~ xf - 1)

Two problems. The first problem is that without an external source (Google), I would have never stumbled upon what I wanted. ( Thanks Google!) I understand it now, but for what I wanted to do, I would never have thought, “oh, model.matrix.”

The second problem is the arcane syntax, wtf <- ~ xf - 1. I get it now, but it took me some time to figure out what was going on. I get it, but why not just dummy(var)? This is what I want to do.

The solution on the wiki wasn’t quite what I was looking for. For instance, you can’t say:

model.matrix( ~ xf1 + xf2 + xf3- 1)

It turns out, you can only expand one variable at a time. Well, this is not good. I know that you could solve this with some sapply’s and some tests, but next time I might forgot about how to do it. So with a couple of spare hours, I decided that the next guy, wouldn’t have to think about it. He could just use my dummies package.

Like the R-wiki solution, the dummies package provides a nice interface for encoding a single variable. You can pass a variable -or- a variable name with a data frame. These are equivalent:


dummy( df$var )
dummy( "var", df )

Moreover, you can choose the style of the dummy names, whether to include unused factor level, to have verbose output, etc.

But more than the R-wiki solution, dummy.data.frame offers to something similar to data.frames. You can specify which columns to expand by name or class and whether to return non-expanded columns.

The package dummies-1.04 is available in CRAN. Comments and questions are always appreciated.


mapReduce Reduced (& Ported to R)

September 10, 2009

Saying MapReduce and Sector’s implementation of User Defined Functions (UDF) over a storage cloud are innovative is only partly correct. The programming models they implement are quite old. Any programmer versed in functional languages recognizes this.

But mapReduce does come with two important innovations. The first is a framework that is specifically designed for large clusters of low-priced, commodity servers. What mapReduce has done is taken the concurrent programming models and applie them to the economic realities of the day. Large and formerly expensive computations can this be accomplished cheaply when distributed to inexpensive machines. The complexity of managing individual machines and tasks is masked from the coder. The coder does not need to worry about associating or managing which tasks get run on which machine. This is invisible to the coder.

The second innovation is the recognition that a large class of practical problems (but not all) can be solved using mapReduce framework. Because the first innovation allowed solutions to problems that were intractable with conventional techniques, technologist began framing problems to run with the MapReduce. They had a hammer; everything began looking like a nail. Fortunately, there were a lot of nails.

As mentioned above, the algorithmic pattern, itself, is not new. It is actually decades old and is a throwback to the early days of functional programming (think Lisp!) and big mainframes. The method was rediscovered, applied over a distributed virtual filesystem, applied to Google’s toughest problems, renamed mapReduce and the rest is history.

The mapReduce algorithm provides a framework for dividing a problem and working on it in parallel. There are two steps: a map step and a reduce step. Although, the two steps must proceed serially — map must preceded reduce — each step can be accomplished in parallel. In the map step, data is mapped to key-value pairs. In the reduce step, the values that share the same key are transformed (‘reduce’) by some algorithm. More complexity can be added; other functions can be used; arbitrary UDF can be supported, as in Sector. But, in essence, the algorithm is as a series of function calls.

The pattern is fairly common and most programmers have used the mapReduce pattern without knowing it, thinking about it, or calling it mapReduce. In fact, much of SAS is setup in a mapReduce style. SAS programs are comprised of DATA STEPs and PROCEDURE STEPs. In certain problems, the DATA step can be a mapper and either a DATA or PROCEDURE step can function as a reducer. If you disregard, the distribution of the problem across servers, I’d venture to say that every SAS programmer has followed this paradigm, often, numerous times in the course of a single program. This simplicity allowed for the application to a wide series of problems, the second innovation.

The same can be said for our favorite statistical programming language, R. In fact, owing to the fact that R’s is a vectorized, functional language, mapReduce boils down to a single line of code:

apply( map(data), reduce )

Where, map and reduce are the user-defined functions for the mapper and reducer respectively and apply distribution the problem in parallel. Any R programmer that was taking advantage of R’s vectorization was probably writing mapReduce problems from day one. Most often, the jobs were vectorized on a individual, single core machines.

Coupled with R packages such as Rmpi, rpvm and nws, the apply-map-reduce pattern can be distributed to several machines. And even more recently, the mutlicore has allowed the easiest implementation on multicores.

We recognized this several years ago, wrote some simple code and have been distributing work across available servers for some time. More recently, we have released our work as an open source package on CRAN for implementing this pattern. Our implementation follows closely to the mapReduce Google paper, is written in pure R and is agnostic to the parallelization backend whether rpvm, rmpi, nws, multicore, or others. ( Revolution Computing recognized this as a goof idea and adopted the same approach with their ParallelR package. )

The use of the mapReduce is exceedingly simple. The package provided a single function, mapReduce. The function is defined as:

Usage
mapReduce( map, ..., data, apply = sapply)

Arguments

map An expression to be evaluated on data which yielding a vector that is subsequently
used to split the data into parts that can be operated on independently.
... The reduce step(s). One or more expressions that are evaluated for each of the partitions made
data A R data structure such as a matrix, list or data.frame.
apply The functions used for parallelization

Take the iris dataset, data(iris). Using mapReduce, we can quickly compute the mean and max petal lengths as so:


mapReduce(
map=Species,
mean.petal.length=mean(Petal.Length),
max.petal.length=max(Petal.Length) ,
data = iris
)

mean.petal.length max.petal.length
setosa 1.462 1.9
versicolor 4.260 5.1
virginica 5.552 6.9

The mean and max petal lengths are computed for each Species and returned as a matrix with two columns, mean.petal.length and max.petal.length and one row for each Species.

Because we have used expressions in our implementations, you can use almost any R function for the map and reduce step. ( Most work, there are few edge case exceptions.) For example, suppose we wanted to do the above calculation but wanted versicolor and virginica lumped together.


mapReduce(
substr(Species,1,1),
mean.petal.length=mean(Petal.Length),
max.petal.length=max(Petal.Length),
data=iris
)

mean.petal.length max.petal.length
s 1.462 1.9
v 4.906 6.9

There you have it, simple yet powerful mapReduce in R. mapReduce can be downloaded from any CRAN mirror. If you get a chance to use it, please let me know what you think.


Three Skills of Data Geeks: Redux

July 14, 2009

Michael Driscoll recently wrote a nice blog article entitled the Three Skills of the Data Geeks in the Dataspora Blog.  He lists this as studying, data munging and “story-telling”.  ( A commenter adds a fourth: decision-making. ) I was excited to see that Driscoll included “story-telling”.  I have long felt being able to relate a narrative around data and inference is an important talent.  Driscoll, however, pulls his punch.  He does not mean literal story-telling, he meant only visualization.

I can’t argue that creating striking visual representations is not an important.   But the more general skill of story-teller is also important.  Visualization is only part of the presentation that the data.  I often meet talented analysts that have nailed Driscoll’s three sexy skills.  Invariably, they have one or more advanced degrees and they have a ton of experience working data.  They can implement algorithms, parse data. spin graphs with no problem.  It is far rarer to meet someone who can tell the narrative and capture interest and excitement of the narrative whispered by data.   The one who can is the complete geek.

There are many facets to relating a good story.  You must sense the arc of drama, feel tension, contemplate historical context, promotes relevance and emote nuances.   Narrative story-telling is not mechanical, it demands creativity, patience, effort, confidence and command of the language.  It requires you to have a keen observation and know when an anecdote strengths or dilutes a conclusion.  People are wired to respond to stories.  Telling them well serves you in any field.  This one especially.

*     *     *     *     *

I sometimes find myself in the enviable position of looking for the next super-geek whether for Open Data or one of  our large clients.  When asked to give input on the process, I always stipulate two requirements.  The candidates must submit two writing samples: one technical and one not.  I can usually gauge from their resume or a quick conversation if they have the first three sexy skills.   Figuring out if the grasp the narrative is much harder.

Invariably, I read the non-technical writing sample first.  If the candidate has strong command of the language can relate technical (often boring) details in a compelling way and shows an interest in subjects outside the scope of his or her work, I know there is at least a latent potential to be that complete geek we all want on our staff or as colleagues.


One Day Course in Cloud Computing and Analytics

June 9, 2009

We’ll be offering a one day course introducing cloud computing and analytics in Chicago on June 22 and in San Mateo on July 14, 2009.

Information about the course can be found at blog.opendatagroup.com/courses

If you are currently between jobs, please send a cover letter indicating why you would like to attend the course and a resume to fee-waived-workshop@opendatagroup.com. Those selected can attend the course without charge. We have reserved five seats in the course for this purpose.


Three Common Mistakes in Analytic Projects

June 1, 2009

In this post, I’ll describe some of the most common mistakes that occur when managing analytic projects.

Mistake 1. Underestimating the time required to get the data. This is probably the most common mistake in modeling projects. Getting the data required for analytic projects usually requires a special request to the IT department. Any special requests made to IT departments can take time. Usually, several meetings are required between the business owners of the analytic problem, the statisticians building the models, and the IT department in order to decide what data is required and whether it is available. Once there is agreement on what data is required, then the special request to the IT department is made and the wait begins. Project managers are sometimes under the impression that good models can be built without data, just as statisticians are sometimes under the impression that modeling projects can be managed without a project plan.

Mistake 2. There is not a good plan for deploying the model. There are several phases in a modeling project. In one phase, data is acquired from the IT department and the model is built. A statistician is usually in charge of building the model. In the next phase, the model is deployed. This is the responsibility of the IT department. This requires providing the model with the appropriate data, post-processing the scores produced by the model to compute the associated actions, and then integrating these actions into the required business processes. Deploying models is in many cases just as complicated or more complicated than building the models and requires a plan. A good standards-compliant architecture can help here. It is often useful for the statistician to export the model as PMML. The model can then be imported by the application used in the operational system.

Mistake 3. Trying to build the perfect model. Another common mistake is trying to build the perfect statistical model. Usually, the impact of a model will be much higher if a model that is good enough is deployed and then a process is put in place that: i) reviews the effectiveness of the model frequently with the business owner of the problem; ii) refreshes the model on a regular basis with the most recent data; and, iii) rebuilds the model on a periodic basis with the lessons learned from the reviews.