One Day Course in Cloud Computing and Analytics

June 9, 2009

We’ll be offering a one day course introducing cloud computing and analytics in Chicago on June 22 and in San Mateo on July 14, 2009.

Information about the course can be found at blog.opendatagroup.com/courses

If you are currently between jobs, please send a cover letter indicating why you would like to attend the course and a resume to fee-waived-workshop@opendatagroup.com. Those selected can attend the course without charge. We have reserved five seats in the course for this purpose.


Three Common Mistakes in Analytic Projects

June 1, 2009

In this post, I’ll describe some of the most common mistakes that occur when managing analytic projects.

Mistake 1. Underestimating the time required to get the data. This is probably the most common mistake in modeling projects. Getting the data required for analytic projects usually requires a special request to the IT department. Any special requests made to IT departments can take time. Usually, several meetings are required between the business owners of the analytic problem, the statisticians building the models, and the IT department in order to decide what data is required and whether it is available. Once there is agreement on what data is required, then the special request to the IT department is made and the wait begins. Project managers are sometimes under the impression that good models can be built without data, just as statisticians are sometimes under the impression that modeling projects can be managed without a project plan.

Mistake 2. There is not a good plan for deploying the model. There are several phases in a modeling project. In one phase, data is acquired from the IT department and the model is built. A statistician is usually in charge of building the model. In the next phase, the model is deployed. This is the responsibility of the IT department. This requires providing the model with the appropriate data, post-processing the scores produced by the model to compute the associated actions, and then integrating these actions into the required business processes. Deploying models is in many cases just as complicated or more complicated than building the models and requires a plan. A good standards-compliant architecture can help here. It is often useful for the statistician to export the model as PMML. The model can then be imported by the application used in the operational system.

Mistake 3. Trying to build the perfect model. Another common mistake is trying to build the perfect statistical model. Usually, the impact of a model will be much higher if a model that is good enough is deployed and then a process is put in place that: i) reviews the effectiveness of the model frequently with the business owner of the problem; ii) refreshes the model on a regular basis with the most recent data; and, iii) rebuilds the model on a periodic basis with the lessons learned from the reviews.


Analytic Infrastructure – Three Trends

May 11, 2009

This is a post about systems, applications, services and architectures for building and deploying analytics. Sometimes this is called analytic infrastructure. In this post, we look at several trends impacting analytic infrastructure.

Trend 1. Open source analytics has reached Main Street. R, which was first released in 1996, is now the most widely deployed open source system for statistical computing. A recent article in the New York Times estimated that over 250,000 individuals use R regularly. Dice News has created a video called “What’s Up with R” to inform job hunters using their services about R. In the language of Geoffrey A. Moore’s book Crossing the Chasm, R has reached “Main Street.”

Some companies still either ban the use of open source software or require an elaborate approval process before open source software can be used. Today, if a company does not allow the use of R, it puts the company at a competitive disadvantage.

Trend 2. The maturing of open, standards based architectures for analytics. Many of the common applications used today to build statistical models are stand-alone applications designed to be used by a single statistician. It is usually a challenge to deploy the model produced by the application into operational systems. Some applications can express statistical models as C++ or SQL, which makes deployment easier, but it can still be a challenge to transform the data into the format expected by the model.

The Predictive Model Markup Language (PMML) is an XML language for expressing statistical and data mining models that was developed to provide an application-independent and platform-independent mechanism for importing and exporting models. PMML has become the dominant standard for statistical and data mining models. Many applications now support PMML.

By using these applications, it is possible to build an open, modular standards based environment for analytics. With this type of open analytic environment, it is quicker and less labor-intensive to deploy new analytic models and to refresh currently deployed models.

Disclaimer: I’m one of the many people that has been involved in the development of the PMML standard.

Trend 3. Cloud-based data services. Over the next several years, cloud-based services will begin to impact analytics significantly. A later post in this series will show simple it is use R in a cloud for example. Although there are security, compliance and policy issues to work out before it becomes common to use clouds for analytics, I expect that these and related issues will all be worked out over the next several years.

Cloud-based services provide several advantages for analytics. Perhaps the most important is elastic capacity — if 25 processors are needed for one job for a single hour, then these can be used for just the single hour and no more. This ability of clouds to handle surge capacity is important for many groups that do analytics. With the appropriate surge capacity provided by clouds, modelers can be more productive, and this can be accomplished in many cases without requiring any capital expense. (Third party clouds provide computing capacity that is an operating and not a capital expense.)