R is ready for big data

Original URL: https://www.theregister.com/2012/06/03/big_data_r_statistical_analysis/

Take the open road to statistical analysis

Posted in Databases, 3rd June 2012 23:00 GMT

Statistical analysis has been around since mainframes were introduced to academia and corporations back in the 1960s.

But the great diversity of telemetry collected by systems today, the need to sift through it for insight and the growing popularity of open-source alternatives is transforming the R programming language for statistical analysis and visualisation. Its new nickname is Red Hat for stats.

Everybody loves R, particularly those selling big-data products such as data warehouses and Hadoop data munchers.

Part of the reason is that R is an open source package that solicits input from a large and clever community of statisticians and quantitative analysts who are able to steer its development.

Alphabet soup

This was not the case for proprietary tools created by SAS Institute and SPSS at the dawn of the mainframe era, and their follow-ons in the distributed computing era.

Just as Linux can be thought of as an open-source analog to Unix, the R programming language borrows heavily from the S language.

This was created by John Chambers at Bell Labs in 1976, as a reaction to the pricey but well respected SPSS and SAS tools that came out nearly a decade earlier.

S is very much a child of the VAX and Unix minicomputer era, while R is a product of the PC and Linux era.

The R language was created in 1996 by Ross Ihaka and Robert Gentleman, two stats professors from the University of Auckland in New Zealand who are still core members of the R development team. (Incidentally, so is Chambers, the creator of S, and it is no accident that some data crunching routines for S will run unchanged in the R environment.)

R can be thought of as a modern implementation of S. So can S-PLUS, created by a company called Insightful, which licensed S from Lucent Technologies in 2004 and was eaten by Tibco Software in 2008.

Come the revolution

Unlike S and to a certain extent S-PLUS, R is not just some code created in an ivory tower.

It is the product of a community of statisticians and coders which has created more than 2,500 plug-ins for chewing on various data sets and doing statistical analysis tuned specifically for particular data types or industries.

R is used by more than two million quantitative analysts worldwide, according to estimates made by Revolution Analytics, which was founded in 2007 to create a parallel implementation of R.

Since then, the company has taken an open-core approach to R, offering commercial support for the open-source package, while at the same time extending the R environment to run better on clusters of machines and in conjunction with Hadoop clusters.

To date, no one has commercialised the PSPP open-source alternative to SPSS (acquired by IBM in July 2009), but it would not be surprising to see this happen at some point, if PSSP matures.

Revolution Analytics has not exactly made the R community happy by peddling proprietary extensions to R in its R Enterprise distribution, after getting some seed money from Intel Capital in 2008 and $9m in venture money in 2009.

Since then, Revolution Analytics has parallelised the underlying R statistical engine so it runs better on multicore/multithreaded processors and across server clusters; added a NoSQL-like format called XDF to help parallelise data sets; and added support for native SAS file formats and conversion to XDF

Most recently it has tweaked its R implementation so each node in a Hadoop cluster can run R analytics locally on the Hadoop cluster on data stored in the Hadoop Distributed File System and then aggregate the results of those calculations, much like MapReduce operations on unstructured data.

Revolution Analytics has soaked up a lot of the oxygen in the R room for the past few years. But other companies are doing interesting things, integrating R tools with their own products and making life easier for analysts seeking answers in mountains of data.

Parallel universe

Seeking some kind of advantage over its rivals in the data warehousing space, Netezza opened up the Netezza software stack in February 2010.

Netezza is a maker of data warehousing appliances based on a heavily customised and parallelised version of the PostgreSQL database, which uses field programmable gate arrays (FPGAs) to boost its performance running on x86 clusters.

Netezza opened up its software development environment with a set of APIs that allow SAS and R algorithms to run in parallel on its warehouse appliances.

It also similarly offered hooks for Java, C++, Fortran, or Python applications to reach into the data warehouse and use the FPGAs to extract data stored in the warehouse rather than using the SQL database query language.

Seven months later, as it became clearer that big data was going to be big business, IBM snapped up privately held Netezza for a cool $1.7bn.

In October 2010, data warehouse maker Teradata added its own in-database analytics to its eponymous data warehouses with a package called TeradataR.

This turns the Teradata Warehouse Miner tool into a plug-in for the R console, allowing for 44 different analytical functions in Teradata databases, as well as any stored procedures in data warehouses to be exposed to R and called from R programs. There are another 20 functions that let R work in the Teradata environment.

The idea is to stay within the R console and run the analytics in parallel on the database, instead of trying to suck information down into a workstation and running R locally.

Oracle joins in

Even Oracle is getting in on the R act. In February, the company launched Advanced Analytics, a bridge between Oracle databases and the R analytical engine.

Advanced Analytics is Oracle's Data Mining add-on for its 11g R2 database. When R programmers want to run a statistical routine, they call the equivalent SQL function in the Data Mining toolbox and run that against the database.

If there is no such SQL function, then an embedded R engine spread across database nodes (if it is a cluster) runs the R routines, collects up summary data and presents it back to the R console as an answer.

Oracle also ships something called the R Connector for Hadoop for its Big Data Appliance, a version of the Cloudera CDH3 Hadoop environment running on Oracle's Exa x86 cluster iron.

This connector lets an R console talk to Hadoop Distributed File System and NoSQL databases running on the Big Data Appliance. ®