R: The Dummies Package

September 30, 2009

R-2.9.2 was released in August. While R can be considered stable and battle-ready, it is also far from stagnation. It is humbling to see such an intelligent and vibrant community helping CRAN grow faster than ever. Every day I see a new package or read a new comment on R-Help gives me pause to think.

As much as I like R, on occasion I will find myself lost in some dark corner. Sometimes, I find light. Sometimes I am gnashing teeth and wringing hands. Frustrated. In a recent foray, I found myself trying to do something that I thought exceedingly trivial: expanding character and factor vectors to dummy variables. There must be some function, but what? Trying ?dummy didn’t turn up anything. Surely some else must have encountered this and provided a package. I went to the Internet and sure enough the R-wiki was here to save me. And looking even harder, I found some who had treaded before me on the R-Help archives. It turns out, it’s simple. Expanding a variable as a dummy variable can be done like so:


x <- c(2, 2, 5, 3, 6, 5, NA)
xf <- factor(x, levels = 2:6)
model.matrix( ~ xf - 1)

Two problems. The first problem is that without an external source (Google), I would have never stumbled upon what I wanted. ( Thanks Google!) I understand it now, but for what I wanted to do, I would never have thought, “oh, model.matrix.”

The second problem is the arcane syntax, wtf <- ~ xf - 1. I get it now, but it took me some time to figure out what was going on. I get it, but why not just dummy(var)? This is what I want to do.

The solution on the wiki wasn’t quite what I was looking for. For instance, you can’t say:

model.matrix( ~ xf1 + xf2 + xf3- 1)

It turns out, you can only expand one variable at a time. Well, this is not good. I know that you could solve this with some sapply’s and some tests, but next time I might forgot about how to do it. So with a couple of spare hours, I decided that the next guy, wouldn’t have to think about it. He could just use my dummies package.

Like the R-wiki solution, the dummies package provides a nice interface for encoding a single variable. You can pass a variable -or- a variable name with a data frame. These are equivalent:


dummy( df$var )
dummy( "var", df )

Moreover, you can choose the style of the dummy names, whether to include unused factor level, to have verbose output, etc.

But more than the R-wiki solution, dummy.data.frame offers to something similar to data.frames. You can specify which columns to expand by name or class and whether to return non-expanded columns.

The package dummies-1.04 is available in CRAN. Comments and questions are always appreciated.


mapReduce Reduced (& Ported to R)

September 10, 2009

Saying MapReduce and Sector’s implementation of User Defined Functions (UDF) over a storage cloud are innovative is only partly correct. The programming models they implement are quite old. Any programmer versed in functional languages recognizes this.

But mapReduce does come with two important innovations. The first is a framework that is specifically designed for large clusters of low-priced, commodity servers. What mapReduce has done is taken the concurrent programming models and applie them to the economic realities of the day. Large and formerly expensive computations can this be accomplished cheaply when distributed to inexpensive machines. The complexity of managing individual machines and tasks is masked from the coder. The coder does not need to worry about associating or managing which tasks get run on which machine. This is invisible to the coder.

The second innovation is the recognition that a large class of practical problems (but not all) can be solved using mapReduce framework. Because the first innovation allowed solutions to problems that were intractable with conventional techniques, technologist began framing problems to run with the MapReduce. They had a hammer; everything began looking like a nail. Fortunately, there were a lot of nails.

As mentioned above, the algorithmic pattern, itself, is not new. It is actually decades old and is a throwback to the early days of functional programming (think Lisp!) and big mainframes. The method was rediscovered, applied over a distributed virtual filesystem, applied to Google’s toughest problems, renamed mapReduce and the rest is history.

The mapReduce algorithm provides a framework for dividing a problem and working on it in parallel. There are two steps: a map step and a reduce step. Although, the two steps must proceed serially — map must preceded reduce — each step can be accomplished in parallel. In the map step, data is mapped to key-value pairs. In the reduce step, the values that share the same key are transformed (‘reduce’) by some algorithm. More complexity can be added; other functions can be used; arbitrary UDF can be supported, as in Sector. But, in essence, the algorithm is as a series of function calls.

The pattern is fairly common and most programmers have used the mapReduce pattern without knowing it, thinking about it, or calling it mapReduce. In fact, much of SAS is setup in a mapReduce style. SAS programs are comprised of DATA STEPs and PROCEDURE STEPs. In certain problems, the DATA step can be a mapper and either a DATA or PROCEDURE step can function as a reducer. If you disregard, the distribution of the problem across servers, I’d venture to say that every SAS programmer has followed this paradigm, often, numerous times in the course of a single program. This simplicity allowed for the application to a wide series of problems, the second innovation.

The same can be said for our favorite statistical programming language, R. In fact, owing to the fact that R’s is a vectorized, functional language, mapReduce boils down to a single line of code:

apply( map(data), reduce )

Where, map and reduce are the user-defined functions for the mapper and reducer respectively and apply distribution the problem in parallel. Any R programmer that was taking advantage of R’s vectorization was probably writing mapReduce problems from day one. Most often, the jobs were vectorized on a individual, single core machines.

Coupled with R packages such as Rmpi, rpvm and nws, the apply-map-reduce pattern can be distributed to several machines. And even more recently, the mutlicore has allowed the easiest implementation on multicores.

We recognized this several years ago, wrote some simple code and have been distributing work across available servers for some time. More recently, we have released our work as an open source package on CRAN for implementing this pattern. Our implementation follows closely to the mapReduce Google paper, is written in pure R and is agnostic to the parallelization backend whether rpvm, rmpi, nws, multicore, or others. ( Revolution Computing recognized this as a goof idea and adopted the same approach with their ParallelR package. )

The use of the mapReduce is exceedingly simple. The package provided a single function, mapReduce. The function is defined as:

Usage
mapReduce( map, ..., data, apply = sapply)

Arguments

map An expression to be evaluated on data which yielding a vector that is subsequently
used to split the data into parts that can be operated on independently.
... The reduce step(s). One or more expressions that are evaluated for each of the partitions made
data A R data structure such as a matrix, list or data.frame.
apply The functions used for parallelization

Take the iris dataset, data(iris). Using mapReduce, we can quickly compute the mean and max petal lengths as so:


mapReduce(
map=Species,
mean.petal.length=mean(Petal.Length),
max.petal.length=max(Petal.Length) ,
data = iris
)

mean.petal.length max.petal.length
setosa 1.462 1.9
versicolor 4.260 5.1
virginica 5.552 6.9

The mean and max petal lengths are computed for each Species and returned as a matrix with two columns, mean.petal.length and max.petal.length and one row for each Species.

Because we have used expressions in our implementations, you can use almost any R function for the map and reduce step. ( Most work, there are few edge case exceptions.) For example, suppose we wanted to do the above calculation but wanted versicolor and virginica lumped together.


mapReduce(
substr(Species,1,1),
mean.petal.length=mean(Petal.Length),
max.petal.length=max(Petal.Length),
data=iris
)

mean.petal.length max.petal.length
s 1.462 1.9
v 4.906 6.9

There you have it, simple yet powerful mapReduce in R. mapReduce can be downloaded from any CRAN mirror. If you get a chance to use it, please let me know what you think.