The hash package: hashes come to R

Perl has hashes. Python has dictionaries. Why doesn’t R have an equivalent? Hash tables and associative arrays are indispensable tools for the programmer. One of the most common and basic tasks of a programmer is to “look up” or “map” a key to a value. In fact, there are projects whose sole raison d’ĂȘtre is making the hash as fast and as efficient as possible.

R actually has two equivalents, both lacking. The first is R’s named vectors and lists. Elements of vectors and lists can be accessed by name, through the standard R methods:


obj$name
obj['name']
obj[['name']]

Vectors are not stored using internal hash tables and as they grow large, performance can suffer. The performance impact is tangible even on small lists. For programs doing many look-ups or look-ups on many objects, this can create a bottleneck.

R’s environments are much closer to Perl hashes and Python’s dictionary. The structure of the environment is a hash table internally and look-ups do not appreciably degrade with object size. To use a R environment, you need to create it and assign key-value pairs to it.


hash = new.env(hash=TRUE, parent=emptyenv(), size=100L)
assign(key, value, hash)
get(key, hash)

We can even get the keys from the hash with the ls function:


ls( env=hash )

This works well and perfomance is good. So what’s the problem?

Usability. In designing, the S language, John Chambers put much thought into how the analyst and statistician interact with data. All varaibles are designed to be vectors and a standard set of accessors( $, [, [[ ) were defined to retrieve and set slices, subsets or elements of the data. The problem is that R environments don’t follow this pattern. And this is where the hash package comes in.

The hash package is designed to provide an R-syntax to R’s environments and give programmers a hash. The package provides one constructor function, hash that will take a variety of arguments, always doing the right thing. All of the following work:


hash()
hash( keys=c('foo','bar','baz'), values=1:3 )
hash( foo=1, bar=2, baz=3 )
hash( c( foo=1, bar=2, baz=3 ) )
hash( list( foo=1, bar=2, baz=3 ) )
hash( c('foo','bar','baz'), 1:3 )

It pretty much does what you mean.

The standard accessors: [, [[, $ are also available.


h <- hash( c('foo','bar','baz'), 1:3 )
h[ c('foo','bar') ]
h[[ 'foo' ]]
h$foo

As does their corresponding replacement methods.


h <- hash( c('foo','bar','baz'), 1:3 )
h[ c('foo','bar') ] <- c( 'fred', 'wilma' )
h[[ 'foo' ]] <- 'dino'
h$foo <- 'bam bam'

There you have it, hashes for R.

I (CB) am the maintainer of the package, so if you have any suggestions for the package, please let me know.

Advertisements

13 Responses to The hash package: hashes come to R

  1. John Gant has pointed out that, at present, hash cannot contain hashes.

    John writes:

    I found a super simple example showing how its not working for me. That eliminates all of my code, which should make finding the issue easier.

    > library( hash )
    > h1 = hash()
    > h2 = hash()
    > h2[[ 101 ]] = 102
    > h1[[ 100 ]] = h2
    Error in FUN(“1″[[1L]], …) : object ‘1’ not found

    * * * * *

    My Response:

    The problem is that hash does not support hashes as values. The problem is:

    h1[[ 100 ]] <- h2

    In this case the assignment tries to interpret h2, rather than say creating a reference. There are some other problems with
    embedding hashes.

    h1 h1[[“100”]]
    [[1]]
    An object of type ‘hash’ containing 1 key-value pairs.
    101 : 102
    ## LOOKS GOOD ##
    > h1[[“100”]][[“101”]]
    NULL
    ## UH-OH ##

    Much of the problem steps from how R handles environments. It is going to take a bit of thought and some effort this weekend. Until then I would consider hash in hash as unsupported. But I understand that is a very important use case for both me and you.

    * * * * *

    I am presently testing a fix. If all works well the package will be posted to CRAN soon.

  2. The fix is in. Hash 1.0 is hitting CRAN now. This is considered the first production-ready version. Several changes were made to make it more R-ish and speedier.

    Feedback is appreciated.

  3. From 2.9.0:

    o A mechanism now allows S4 classes to inherit from object types “environment”, “externalptr” and symbol (“name”). See ?setClass.

    This will be changed in the Hash package concomitant with R-2.10.0

  4. Version 1.0.2 was uploaded to CRAN this evening. See the ChangeLog for details. Small-ish bug fixes release only.

  5. This version does however use the .Data slot to contain the underlying environment. The API did not change at all.

    C-

  6. Denise Mauldin says:

    Hi there,

    It appears that the fix does not allow for any layer deep of hashes? I’m trying to do a hash of a hash of a hash and getting an error:

    Error in get(make.keys(i), x@.Data) : object ‘1’ not found

    ie:
    key <- 'one'
    ikey <- 'two'
    val <- 'three'
    info <- hash()
    info[key] <- hash( keys=c(ikey), values=c(val) )

    • Denise,

      Sorry for the late reply and thanks for the bug report. Indeed this was not behaving as expected. Moments ago, I released hash version 1.10.0 to address this and other minor annoyances. It should be trickling its way through the CRAN mirrors. Here is the result of your test case:

      > key ikey val info info[key] info
      containing 1 key-value pairs.
      one : containing 1 key-value pairs.
      two : three

      I think that this is what you expected.

      Thanks Again,

      Chris

  7. maximilian haussler says:

    I want to mention that often you don’t need hashes in R if you use dataframes and join them with the merge() function.

    • Maximilian,

      Yes. You can emulate hashes by using data.frames and the merge() function. You can also use named lists. And if all items of the hash are of the same base class, you can even use named vectors.

      The problem with each of these emulations is that they do not scale well, O(n). There are fine when n is small but are disastrous when it grows. The hash packages solves this by using R’s environments, i.e. real hash tables. In fact, the hash packages only provides a more intuitive interface.

      Try it. Compare using data frames and merge tables against hash on a million records.

      Chris

  8. Alex says:

    Dang, it’s broken in R version 2.11 (both on my install, and on CRAN).

    The failure to install (either from CRAN or from source) gives the message:

    Defining type “environment” as a superclass via class “.environment”
    Error in matchSignature(signature, fdef) :
    more elements in the method signature (2) than in the generic signature (1)
    Error : unable to load R code in package ‘hash’

    So it’s probably something simple, but I can’t figure it out. Commenting out all the “generic” signatures didn’t help, either. Bummer!

    I hope this can get fixed, as “hash” is literally the best single R package there is.

    • Christopher says:

      Alex,

      It took me a month to find the time to fix it, but I uploaded version 1.99.3 of the hash packages to CRAN earlier today. It was tested on the development branch of R ( version 2.11 ), so it should work great for you. If you have any problems, please let me know.

      Chris

  9. Daniel Keller says:

    Hi Chris,

    I’ve found this package quite useful, thank you for creating it. However, I’ve noticed that with a moderately large hash (~5k items) inserting can become very slow due to the check to see if the item is present, which fetches the entire list of keys and searches it. It appears to be a trivial change to work around this by using a tryCatch block to catch the error generated by R if the key is missing:

    onError <- function(e) {
    cat("key:", key, "not found in hash:", substitute(x), "\n"); NULL;
    };
    return(tryCatch(get(key, x@.Data), error=onError));

    Cheers,
    Daniel

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: