R: The Dummies Package

September 30, 2009

R-2.9.2 was released in August. While R can be considered stable and battle-ready, it is also far from stagnation. It is humbling to see such an intelligent and vibrant community helping CRAN grow faster than ever. Every day I see a new package or read a new comment on R-Help gives me pause to think.

As much as I like R, on occasion I will find myself lost in some dark corner. Sometimes, I find light. Sometimes I am gnashing teeth and wringing hands. Frustrated. In a recent foray, I found myself trying to do something that I thought exceedingly trivial: expanding character and factor vectors to dummy variables. There must be some function, but what? Trying ?dummy didn’t turn up anything. Surely some else must have encountered this and provided a package. I went to the Internet and sure enough the R-wiki was here to save me. And looking even harder, I found some who had treaded before me on the R-Help archives. It turns out, it’s simple. Expanding a variable as a dummy variable can be done like so:

x <- c(2, 2, 5, 3, 6, 5, NA)
xf <- factor(x, levels = 2:6)
model.matrix( ~ xf - 1)

Two problems. The first problem is that without an external source (Google), I would have never stumbled upon what I wanted. ( Thanks Google!) I understand it now, but for what I wanted to do, I would never have thought, “oh, model.matrix.”

The second problem is the arcane syntax, wtf <- ~ xf - 1. I get it now, but it took me some time to figure out what was going on. I get it, but why not just dummy(var)? This is what I want to do.

The solution on the wiki wasn’t quite what I was looking for. For instance, you can’t say:

model.matrix( ~ xf1 + xf2 + xf3- 1)

It turns out, you can only expand one variable at a time. Well, this is not good. I know that you could solve this with some sapply’s and some tests, but next time I might forgot about how to do it. So with a couple of spare hours, I decided that the next guy, wouldn’t have to think about it. He could just use my dummies package.

Like the R-wiki solution, the dummies package provides a nice interface for encoding a single variable. You can pass a variable -or- a variable name with a data frame. These are equivalent:

dummy( df$var )
dummy( "var", df )

Moreover, you can choose the style of the dummy names, whether to include unused factor level, to have verbose output, etc.

But more than the R-wiki solution, dummy.data.frame offers to something similar to data.frames. You can specify which columns to expand by name or class and whether to return non-expanded columns.

The package dummies-1.04 is available in CRAN. Comments and questions are always appreciated.

Analytic Infrastructure – Three Trends

May 11, 2009

This is a post about systems, applications, services and architectures for building and deploying analytics. Sometimes this is called analytic infrastructure. In this post, we look at several trends impacting analytic infrastructure.

Trend 1. Open source analytics has reached Main Street. R, which was first released in 1996, is now the most widely deployed open source system for statistical computing. A recent article in the New York Times estimated that over 250,000 individuals use R regularly. Dice News has created a video called “What’s Up with R” to inform job hunters using their services about R. In the language of Geoffrey A. Moore’s book Crossing the Chasm, R has reached “Main Street.”

Some companies still either ban the use of open source software or require an elaborate approval process before open source software can be used. Today, if a company does not allow the use of R, it puts the company at a competitive disadvantage.

Trend 2. The maturing of open, standards based architectures for analytics. Many of the common applications used today to build statistical models are stand-alone applications designed to be used by a single statistician. It is usually a challenge to deploy the model produced by the application into operational systems. Some applications can express statistical models as C++ or SQL, which makes deployment easier, but it can still be a challenge to transform the data into the format expected by the model.

The Predictive Model Markup Language (PMML) is an XML language for expressing statistical and data mining models that was developed to provide an application-independent and platform-independent mechanism for importing and exporting models. PMML has become the dominant standard for statistical and data mining models. Many applications now support PMML.

By using these applications, it is possible to build an open, modular standards based environment for analytics. With this type of open analytic environment, it is quicker and less labor-intensive to deploy new analytic models and to refresh currently deployed models.

Disclaimer: I’m one of the many people that has been involved in the development of the PMML standard.

Trend 3. Cloud-based data services. Over the next several years, cloud-based services will begin to impact analytics significantly. A later post in this series will show simple it is use R in a cloud for example. Although there are security, compliance and policy issues to work out before it becomes common to use clouds for analytics, I expect that these and related issues will all be worked out over the next several years.

Cloud-based services provide several advantages for analytics. Perhaps the most important is elastic capacity — if 25 processors are needed for one job for a single hour, then these can be used for just the single hour and no more. This ability of clouds to handle surge capacity is important for many groups that do analytics. With the appropriate surge capacity provided by clouds, modelers can be more productive, and this can be accomplished in many cases without requiring any capital expense. (Third party clouds provide computing capacity that is an operating and not a capital expense.)