The hash package: hashes come to R

July 26, 2009

Perl has hashes. Python has dictionaries. Why doesn’t R have an equivalent? Hash tables and associative arrays are indispensable tools for the programmer. One of the most common and basic tasks of a programmer is to “look up” or “map” a key to a value. In fact, there are projects whose sole raison d’être is making the hash as fast and as efficient as possible.

R actually has two equivalents, both lacking. The first is R’s named vectors and lists. Elements of vectors and lists can be accessed by name, through the standard R methods:


obj$name
obj['name']
obj[['name']]

Vectors are not stored using internal hash tables and as they grow large, performance can suffer. The performance impact is tangible even on small lists. For programs doing many look-ups or look-ups on many objects, this can create a bottleneck.

R’s environments are much closer to Perl hashes and Python’s dictionary. The structure of the environment is a hash table internally and look-ups do not appreciably degrade with object size. To use a R environment, you need to create it and assign key-value pairs to it.


hash = new.env(hash=TRUE, parent=emptyenv(), size=100L)
assign(key, value, hash)
get(key, hash)

We can even get the keys from the hash with the ls function:


ls( env=hash )

This works well and perfomance is good. So what’s the problem?

Usability. In designing, the S language, John Chambers put much thought into how the analyst and statistician interact with data. All varaibles are designed to be vectors and a standard set of accessors( $, [, [[ ) were defined to retrieve and set slices, subsets or elements of the data. The problem is that R environments don’t follow this pattern. And this is where the hash package comes in.

The hash package is designed to provide an R-syntax to R’s environments and give programmers a hash. The package provides one constructor function, hash that will take a variety of arguments, always doing the right thing. All of the following work:


hash()
hash( keys=c('foo','bar','baz'), values=1:3 )
hash( foo=1, bar=2, baz=3 )
hash( c( foo=1, bar=2, baz=3 ) )
hash( list( foo=1, bar=2, baz=3 ) )
hash( c('foo','bar','baz'), 1:3 )

It pretty much does what you mean.

The standard accessors: [, [[, $ are also available.


h <- hash( c('foo','bar','baz'), 1:3 )
h[ c('foo','bar') ]
h[[ 'foo' ]]
h$foo

As does their corresponding replacement methods.


h <- hash( c('foo','bar','baz'), 1:3 )
h[ c('foo','bar') ] <- c( 'fred', 'wilma' )
h[[ 'foo' ]] <- 'dino'
h$foo <- 'bam bam'

There you have it, hashes for R.

I (CB) am the maintainer of the package, so if you have any suggestions for the package, please let me know.

Advertisements

Three Skills of Data Geeks: Redux

July 14, 2009

Michael Driscoll recently wrote a nice blog article entitled the Three Skills of the Data Geeks in the Dataspora Blog.  He lists this as studying, data munging and “story-telling”.  ( A commenter adds a fourth: decision-making. ) I was excited to see that Driscoll included “story-telling”.  I have long felt being able to relate a narrative around data and inference is an important talent.  Driscoll, however, pulls his punch.  He does not mean literal story-telling, he meant only visualization.

I can’t argue that creating striking visual representations is not an important.   But the more general skill of story-teller is also important.  Visualization is only part of the presentation that the data.  I often meet talented analysts that have nailed Driscoll’s three sexy skills.  Invariably, they have one or more advanced degrees and they have a ton of experience working data.  They can implement algorithms, parse data. spin graphs with no problem.  It is far rarer to meet someone who can tell the narrative and capture interest and excitement of the narrative whispered by data.   The one who can is the complete geek.

There are many facets to relating a good story.  You must sense the arc of drama, feel tension, contemplate historical context, promotes relevance and emote nuances.   Narrative story-telling is not mechanical, it demands creativity, patience, effort, confidence and command of the language.  It requires you to have a keen observation and know when an anecdote strengths or dilutes a conclusion.  People are wired to respond to stories.  Telling them well serves you in any field.  This one especially.

*     *     *     *     *

I sometimes find myself in the enviable position of looking for the next super-geek whether for Open Data or one of  our large clients.  When asked to give input on the process, I always stipulate two requirements.  The candidates must submit two writing samples: one technical and one not.  I can usually gauge from their resume or a quick conversation if they have the first three sexy skills.   Figuring out if the grasp the narrative is much harder.

Invariably, I read the non-technical writing sample first.  If the candidate has strong command of the language can relate technical (often boring) details in a compelling way and shows an interest in subjects outside the scope of his or her work, I know there is at least a latent potential to be that complete geek we all want on our staff or as colleagues.