R is ready for big data

Take the open road to statistical analysis

High performance access to file storage

Statistical analysis has been around since mainframes were introduced to academia and corporations back in the 1960s.

But the great diversity of telemetry collected by systems today, the need to sift through it for insight and the growing popularity of open-source alternatives is transforming the R programming language for statistical analysis and visualisation. Its new nickname is Red Hat for stats.

Everybody loves R, particularly those selling big-data products such as data warehouses and Hadoop data munchers.

Part of the reason is that R is an open source package that solicits input from a large and clever community of statisticians and quantitative analysts who are able to steer its development.

Alphabet soup

This was not the case for proprietary tools created by SAS Institute and SPSS at the dawn of the mainframe era, and their follow-ons in the distributed computing era.

Just as Linux can be thought of as an open-source analog to Unix, the R programming language borrows heavily from the S language.

This was created by John Chambers at Bell Labs in 1976, as a reaction to the pricey but well respected SPSS and SAS tools that came out nearly a decade earlier.

S is very much a child of the VAX and Unix minicomputer era, while R is a product of the PC and Linux era.

The R language was created in 1996 by Ross Ihaka and Robert Gentleman, two stats professors from the University of Auckland in New Zealand who are still core members of the R development team. (Incidentally, so is Chambers, the creator of S, and it is no accident that some data crunching routines for S will run unchanged in the R environment.)

R can be thought of as a modern implementation of S. So can S-PLUS, created by a company called Insightful, which licensed S from Lucent Technologies in 2004 and was eaten by Tibco Software in 2008.

Come the revolution

Unlike S and to a certain extent S-PLUS, R is not just some code created in an ivory tower.

It is the product of a community of statisticians and coders which has created more than 2,500 plug-ins for chewing on various data sets and doing statistical analysis tuned specifically for particular data types or industries.

R is used by more than two million quantitative analysts worldwide, according to estimates made by Revolution Analytics, which was founded in 2007 to create a parallel implementation of R.

Since then, the company has taken an open-core approach to R, offering commercial support for the open-source package, while at the same time extending the R environment to run better on clusters of machines and in conjunction with Hadoop clusters.

To date, no one has commercialised the PSPP open-source alternative to SPSS (acquired by IBM in July 2009), but it would not be surprising to see this happen at some point, if PSSP matures.

Revolution Analytics has not exactly made the R community happy by peddling proprietary extensions to R in its R Enterprise distribution, after getting some seed money from Intel Capital in 2008 and $9m in venture money in 2009.

Since then, Revolution Analytics has parallelised the underlying R statistical engine so it runs better on multicore/multithreaded processors and across server clusters; added a NoSQL-like format called XDF to help parallelise data sets; and added support for native SAS file formats and conversion to XDF

Most recently it has tweaked its R implementation so each node in a Hadoop cluster can run R analytics locally on the Hadoop cluster on data stored in the Hadoop Distributed File System and then aggregate the results of those calculations, much like MapReduce operations on unstructured data.

Revolution Analytics has soaked up a lot of the oxygen in the R room for the past few years. But other companies are doing interesting things, integrating R tools with their own products and making life easier for analysts seeking answers in mountains of data.

Parallel universe

Seeking some kind of advantage over its rivals in the data warehousing space, Netezza opened up the Netezza software stack in February 2010.

Netezza is a maker of data warehousing appliances based on a heavily customised and parallelised version of the PostgreSQL database, which uses field programmable gate arrays (FPGAs) to boost its performance running on x86 clusters.

Netezza opened up its software development environment with a set of APIs that allow SAS and R algorithms to run in parallel on its warehouse appliances.

It also similarly offered hooks for Java, C++, Fortran, or Python applications to reach into the data warehouse and use the FPGAs to extract data stored in the warehouse rather than using the SQL database query language.

Seven months later, as it became clearer that big data was going to be big business, IBM snapped up privately held Netezza for a cool $1.7bn.

In October 2010, data warehouse maker Teradata added its own in-database analytics to its eponymous data warehouses with a package called TeradataR.

This turns the Teradata Warehouse Miner tool into a plug-in for the R console, allowing for 44 different analytical functions in Teradata databases, as well as any stored procedures in data warehouses to be exposed to R and called from R programs. There are another 20 functions that let R work in the Teradata environment.

The idea is to stay within the R console and run the analytics in parallel on the database, instead of trying to suck information down into a workstation and running R locally.

Oracle joins in

Even Oracle is getting in on the R act. In February, the company launched Advanced Analytics, a bridge between Oracle databases and the R analytical engine.

Advanced Analytics is Oracle's Data Mining add-on for its 11g R2 database. When R programmers want to run a statistical routine, they call the equivalent SQL function in the Data Mining toolbox and run that against the database.

If there is no such SQL function, then an embedded R engine spread across database nodes (if it is a cluster) runs the R routines, collects up summary data and presents it back to the R console as an answer.

Oracle also ships something called the R Connector for Hadoop for its Big Data Appliance, a version of the Cloudera CDH3 Hadoop environment running on Oracle's Exa x86 cluster iron.

This connector lets an R console talk to Hadoop Distributed File System and NoSQL databases running on the Big Data Appliance. ®

High performance access to file storage

More from The Register

next story
Audio fans, prepare yourself for the Second Coming ... of Blu-ray
High Fidelity Pure Audio – is this what your ears have been waiting for?
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
MtGox chief Karpelès refuses to come to US for g-men's grilling
Bitcoin baron says he needs another lawyer for FinCEN chat
Did a date calculation bug just cost hard-up Co-op Bank £110m?
And just when Brit banking org needs £400m to stay afloat
Zucker punched: Google gobbles Facebook-wooed Titan Aerospace
Up, up and away in my beautiful balloon flying broadband-bot
Apple DOMINATES the Valley, rakes in more profit than Google, HP, Intel, Cisco COMBINED
Cook & Co. also pay more taxes than those four worthies PLUS eBay and Oracle
It may be ILLEGAL to run Heartbleed health checks – IT lawyer
Do the right thing, earn up to 10 years in clink
France bans managers from contacting workers outside business hours
«Email? Mais non ... il est plus tard que six heures du soir!»
prev story


Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
HP ArcSight ESM solution helps Finansbank
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.