[, [[, $: R accessors explained

October 21, 2009
R Accessors

R Accessors

For more than ten years, I have been teaching R both formally and informally. One thing that I find often trips up students is the use of R’s accessors and mutators. ( For those readers not from a formal computer science background, an accessor is a method for accessing data in an object usually an attribute of that object.) A simple example is taking a subset of a vector:


letters[1:3]
[1] "a" "b" "c"

As you can see, the result is a character vector containing the first three letters of letters vector.

Good programming languages have a standard pattern for accessor and mutators. For R, there are three: [, [[, and $. This confuses beginners coming from other programming languages. Java and Python have one: ‘.’. Why does R need three?

The reason derives from R’s data centric view of the world. R natively provides vectors, lists, data frames, matrices, etc. In truth, one can get by using only [ to extract information from these structures, but the others are handy in certain scenarios. So much so that after a while, they feel indispensible. I will explain each and hopefully by the end of this article you will understand why each exists, what to remember and, more importantly, when to each should be used.

Subset with [

When you want a subset of an object use [. Remember that when you take a subset of an the object you get the same type of thing. Thus, the subset of a vector will be a vector, the subset of a list will be a list and the subset of a data.frame will be a data.frame.

There is one inconsistency, however. The default in R is to reduce the results to the lowest dimension, so if your subset contains only result, you will only get that one item which may be something of a different type. Thus, taking a subset of the iris data frame with only one column


class( iris[ , "Petal.Length" ] )
[1] numeric

returns a numeric vector and not a data frame. You can override this behavior with the little publicized drop parameter, which indicates not to reduce the result. Taking the subset of iris with drop = FALSE


iris[ , "Petal.Length", drop=FALSE ]

is a proper data frame.

Things to Remember:

  • Most often, a subset is the same type as the original object.
  • Both indices and names can be used to extract the subset. ( In order to use names, object must have a name type attribute such as names, rownames, colnames, etc. )
  • You can use negative integers to indicate exclusion.
  • Unquoted variables are interpolated within the brackets.

Extract one item with [[

The double square brackets are used to extract one element from potentially many. For vectors yield vectors with a single value; data frames give a column vector; for list, one element:


letters[[3]]
iris[["Petal.Length"]]

The mnemonic device, here is that the double square bracket look as if you are asking for something deep within a container. You are not taking a slice but reaching to get at the one thing at the core.

Three important things to remember:

  • You can return only one item.
  • The result is not (necessarily) the same type of object as the container.
  • The dimension will be the dimension of the one item which is not necessarily 1.
  • And, as before:
    • Names or indices can both be used.
    • Variables are interpolated.

Interact with $

Interestingly enough, the accessor that provides the least unique utility is also probably used the most often used. $ is a special case of [[ in which you access a single item by actual name. The following are equivalent:


iris$Petal.Length
iris[["Petal.Length"]]

The appeal of this accessor is nothing more than brevity. One character, $, replaces six, [[“”]]. This accessor is handiest when doing interactive programming but should be discouraged for more production oriented code because of its limitations, namely the inability to interpolate the names or use integer indices.

Things to Remember:

  • You cannot use integer indices
  • The name will not be interpolated.
  • Returns only one item.
  • If the name contains special characters, the name must be enclosed in backticks: “

That is really all there is to it. [ – for subsets, [[ – for extracting items, and $ – for extracting by name.


R: The Dummies Package

September 30, 2009

R-2.9.2 was released in August. While R can be considered stable and battle-ready, it is also far from stagnation. It is humbling to see such an intelligent and vibrant community helping CRAN grow faster than ever. Every day I see a new package or read a new comment on R-Help gives me pause to think.

As much as I like R, on occasion I will find myself lost in some dark corner. Sometimes, I find light. Sometimes I am gnashing teeth and wringing hands. Frustrated. In a recent foray, I found myself trying to do something that I thought exceedingly trivial: expanding character and factor vectors to dummy variables. There must be some function, but what? Trying ?dummy didn’t turn up anything. Surely some else must have encountered this and provided a package. I went to the Internet and sure enough the R-wiki was here to save me. And looking even harder, I found some who had treaded before me on the R-Help archives. It turns out, it’s simple. Expanding a variable as a dummy variable can be done like so:


x <- c(2, 2, 5, 3, 6, 5, NA)
xf <- factor(x, levels = 2:6)
model.matrix( ~ xf - 1)

Two problems. The first problem is that without an external source (Google), I would have never stumbled upon what I wanted. ( Thanks Google!) I understand it now, but for what I wanted to do, I would never have thought, “oh, model.matrix.”

The second problem is the arcane syntax, wtf <- ~ xf - 1. I get it now, but it took me some time to figure out what was going on. I get it, but why not just dummy(var)? This is what I want to do.

The solution on the wiki wasn’t quite what I was looking for. For instance, you can’t say:

model.matrix( ~ xf1 + xf2 + xf3- 1)

It turns out, you can only expand one variable at a time. Well, this is not good. I know that you could solve this with some sapply’s and some tests, but next time I might forgot about how to do it. So with a couple of spare hours, I decided that the next guy, wouldn’t have to think about it. He could just use my dummies package.

Like the R-wiki solution, the dummies package provides a nice interface for encoding a single variable. You can pass a variable -or- a variable name with a data frame. These are equivalent:


dummy( df$var )
dummy( "var", df )

Moreover, you can choose the style of the dummy names, whether to include unused factor level, to have verbose output, etc.

But more than the R-wiki solution, dummy.data.frame offers to something similar to data.frames. You can specify which columns to expand by name or class and whether to return non-expanded columns.

The package dummies-1.04 is available in CRAN. Comments and questions are always appreciated.


mapReduce Reduced (& Ported to R)

September 10, 2009

Saying MapReduce and Sector’s implementation of User Defined Functions (UDF) over a storage cloud are innovative is only partly correct. The programming models they implement are quite old. Any programmer versed in functional languages recognizes this.

But mapReduce does come with two important innovations. The first is a framework that is specifically designed for large clusters of low-priced, commodity servers. What mapReduce has done is taken the concurrent programming models and applie them to the economic realities of the day. Large and formerly expensive computations can this be accomplished cheaply when distributed to inexpensive machines. The complexity of managing individual machines and tasks is masked from the coder. The coder does not need to worry about associating or managing which tasks get run on which machine. This is invisible to the coder.

The second innovation is the recognition that a large class of practical problems (but not all) can be solved using mapReduce framework. Because the first innovation allowed solutions to problems that were intractable with conventional techniques, technologist began framing problems to run with the MapReduce. They had a hammer; everything began looking like a nail. Fortunately, there were a lot of nails.

As mentioned above, the algorithmic pattern, itself, is not new. It is actually decades old and is a throwback to the early days of functional programming (think Lisp!) and big mainframes. The method was rediscovered, applied over a distributed virtual filesystem, applied to Google’s toughest problems, renamed mapReduce and the rest is history.

The mapReduce algorithm provides a framework for dividing a problem and working on it in parallel. There are two steps: a map step and a reduce step. Although, the two steps must proceed serially — map must preceded reduce — each step can be accomplished in parallel. In the map step, data is mapped to key-value pairs. In the reduce step, the values that share the same key are transformed (‘reduce’) by some algorithm. More complexity can be added; other functions can be used; arbitrary UDF can be supported, as in Sector. But, in essence, the algorithm is as a series of function calls.

The pattern is fairly common and most programmers have used the mapReduce pattern without knowing it, thinking about it, or calling it mapReduce. In fact, much of SAS is setup in a mapReduce style. SAS programs are comprised of DATA STEPs and PROCEDURE STEPs. In certain problems, the DATA step can be a mapper and either a DATA or PROCEDURE step can function as a reducer. If you disregard, the distribution of the problem across servers, I’d venture to say that every SAS programmer has followed this paradigm, often, numerous times in the course of a single program. This simplicity allowed for the application to a wide series of problems, the second innovation.

The same can be said for our favorite statistical programming language, R. In fact, owing to the fact that R’s is a vectorized, functional language, mapReduce boils down to a single line of code:

apply( map(data), reduce )

Where, map and reduce are the user-defined functions for the mapper and reducer respectively and apply distribution the problem in parallel. Any R programmer that was taking advantage of R’s vectorization was probably writing mapReduce problems from day one. Most often, the jobs were vectorized on a individual, single core machines.

Coupled with R packages such as Rmpi, rpvm and nws, the apply-map-reduce pattern can be distributed to several machines. And even more recently, the mutlicore has allowed the easiest implementation on multicores.

We recognized this several years ago, wrote some simple code and have been distributing work across available servers for some time. More recently, we have released our work as an open source package on CRAN for implementing this pattern. Our implementation follows closely to the mapReduce Google paper, is written in pure R and is agnostic to the parallelization backend whether rpvm, rmpi, nws, multicore, or others. ( Revolution Computing recognized this as a goof idea and adopted the same approach with their ParallelR package. )

The use of the mapReduce is exceedingly simple. The package provided a single function, mapReduce. The function is defined as:

Usage
mapReduce( map, ..., data, apply = sapply)

Arguments

map An expression to be evaluated on data which yielding a vector that is subsequently
used to split the data into parts that can be operated on independently.
... The reduce step(s). One or more expressions that are evaluated for each of the partitions made
data A R data structure such as a matrix, list or data.frame.
apply The functions used for parallelization

Take the iris dataset, data(iris). Using mapReduce, we can quickly compute the mean and max petal lengths as so:


mapReduce(
map=Species,
mean.petal.length=mean(Petal.Length),
max.petal.length=max(Petal.Length) ,
data = iris
)

mean.petal.length max.petal.length
setosa 1.462 1.9
versicolor 4.260 5.1
virginica 5.552 6.9

The mean and max petal lengths are computed for each Species and returned as a matrix with two columns, mean.petal.length and max.petal.length and one row for each Species.

Because we have used expressions in our implementations, you can use almost any R function for the map and reduce step. ( Most work, there are few edge case exceptions.) For example, suppose we wanted to do the above calculation but wanted versicolor and virginica lumped together.


mapReduce(
substr(Species,1,1),
mean.petal.length=mean(Petal.Length),
max.petal.length=max(Petal.Length),
data=iris
)

mean.petal.length max.petal.length
s 1.462 1.9
v 4.906 6.9

There you have it, simple yet powerful mapReduce in R. mapReduce can be downloaded from any CRAN mirror. If you get a chance to use it, please let me know what you think.


The hash package: hashes come to R

July 26, 2009

Perl has hashes. Python has dictionaries. Why doesn’t R have an equivalent? Hash tables and associative arrays are indispensable tools for the programmer. One of the most common and basic tasks of a programmer is to “look up” or “map” a key to a value. In fact, there are projects whose sole raison d’ĂȘtre is making the hash as fast and as efficient as possible.

R actually has two equivalents, both lacking. The first is R’s named vectors and lists. Elements of vectors and lists can be accessed by name, through the standard R methods:


obj$name
obj['name']
obj[['name']]

Vectors are not stored using internal hash tables and as they grow large, performance can suffer. The performance impact is tangible even on small lists. For programs doing many look-ups or look-ups on many objects, this can create a bottleneck.

R’s environments are much closer to Perl hashes and Python’s dictionary. The structure of the environment is a hash table internally and look-ups do not appreciably degrade with object size. To use a R environment, you need to create it and assign key-value pairs to it.


hash = new.env(hash=TRUE, parent=emptyenv(), size=100L)
assign(key, value, hash)
get(key, hash)

We can even get the keys from the hash with the ls function:


ls( env=hash )

This works well and perfomance is good. So what’s the problem?

Usability. In designing, the S language, John Chambers put much thought into how the analyst and statistician interact with data. All varaibles are designed to be vectors and a standard set of accessors( $, [, [[ ) were defined to retrieve and set slices, subsets or elements of the data. The problem is that R environments don’t follow this pattern. And this is where the hash package comes in.

The hash package is designed to provide an R-syntax to R’s environments and give programmers a hash. The package provides one constructor function, hash that will take a variety of arguments, always doing the right thing. All of the following work:


hash()
hash( keys=c('foo','bar','baz'), values=1:3 )
hash( foo=1, bar=2, baz=3 )
hash( c( foo=1, bar=2, baz=3 ) )
hash( list( foo=1, bar=2, baz=3 ) )
hash( c('foo','bar','baz'), 1:3 )

It pretty much does what you mean.

The standard accessors: [, [[, $ are also available.


h <- hash( c('foo','bar','baz'), 1:3 )
h[ c('foo','bar') ]
h[[ 'foo' ]]
h$foo

As does their corresponding replacement methods.


h <- hash( c('foo','bar','baz'), 1:3 )
h[ c('foo','bar') ] <- c( 'fred', 'wilma' )
h[[ 'foo' ]] <- 'dino'
h$foo <- 'bam bam'

There you have it, hashes for R.

I (CB) am the maintainer of the package, so if you have any suggestions for the package, please let me know.