R is ready for big data

Take the open road to statistical analysis

Security for virtualized datacentres

Statistical analysis has been around since mainframes were introduced to academia and corporations back in the 1960s.

But the great diversity of telemetry collected by systems today, the need to sift through it for insight and the growing popularity of open-source alternatives is transforming the R programming language for statistical analysis and visualisation. Its new nickname is Red Hat for stats.

Everybody loves R, particularly those selling big-data products such as data warehouses and Hadoop data munchers.

Part of the reason is that R is an open source package that solicits input from a large and clever community of statisticians and quantitative analysts who are able to steer its development.

Alphabet soup

This was not the case for proprietary tools created by SAS Institute and SPSS at the dawn of the mainframe era, and their follow-ons in the distributed computing era.

Just as Linux can be thought of as an open-source analog to Unix, the R programming language borrows heavily from the S language.

This was created by John Chambers at Bell Labs in 1976, as a reaction to the pricey but well respected SPSS and SAS tools that came out nearly a decade earlier.

S is very much a child of the VAX and Unix minicomputer era, while R is a product of the PC and Linux era.

The R language was created in 1996 by Ross Ihaka and Robert Gentleman, two stats professors from the University of Auckland in New Zealand who are still core members of the R development team. (Incidentally, so is Chambers, the creator of S, and it is no accident that some data crunching routines for S will run unchanged in the R environment.)

R can be thought of as a modern implementation of S. So can S-PLUS, created by a company called Insightful, which licensed S from Lucent Technologies in 2004 and was eaten by Tibco Software in 2008.

Come the revolution

Unlike S and to a certain extent S-PLUS, R is not just some code created in an ivory tower.

It is the product of a community of statisticians and coders which has created more than 2,500 plug-ins for chewing on various data sets and doing statistical analysis tuned specifically for particular data types or industries.

R is used by more than two million quantitative analysts worldwide, according to estimates made by Revolution Analytics, which was founded in 2007 to create a parallel implementation of R.

Since then, the company has taken an open-core approach to R, offering commercial support for the open-source package, while at the same time extending the R environment to run better on clusters of machines and in conjunction with Hadoop clusters.

To date, no one has commercialised the PSPP open-source alternative to SPSS (acquired by IBM in July 2009), but it would not be surprising to see this happen at some point, if PSSP matures.

Revolution Analytics has not exactly made the R community happy by peddling proprietary extensions to R in its R Enterprise distribution, after getting some seed money from Intel Capital in 2008 and $9m in venture money in 2009.

Since then, Revolution Analytics has parallelised the underlying R statistical engine so it runs better on multicore/multithreaded processors and across server clusters; added a NoSQL-like format called XDF to help parallelise data sets; and added support for native SAS file formats and conversion to XDF

Most recently it has tweaked its R implementation so each node in a Hadoop cluster can run R analytics locally on the Hadoop cluster on data stored in the Hadoop Distributed File System and then aggregate the results of those calculations, much like MapReduce operations on unstructured data.

Revolution Analytics has soaked up a lot of the oxygen in the R room for the past few years. But other companies are doing interesting things, integrating R tools with their own products and making life easier for analysts seeking answers in mountains of data.

Parallel universe

Seeking some kind of advantage over its rivals in the data warehousing space, Netezza opened up the Netezza software stack in February 2010.

Netezza is a maker of data warehousing appliances based on a heavily customised and parallelised version of the PostgreSQL database, which uses field programmable gate arrays (FPGAs) to boost its performance running on x86 clusters.

Netezza opened up its software development environment with a set of APIs that allow SAS and R algorithms to run in parallel on its warehouse appliances.

It also similarly offered hooks for Java, C++, Fortran, or Python applications to reach into the data warehouse and use the FPGAs to extract data stored in the warehouse rather than using the SQL database query language.

Seven months later, as it became clearer that big data was going to be big business, IBM snapped up privately held Netezza for a cool $1.7bn.

In October 2010, data warehouse maker Teradata added its own in-database analytics to its eponymous data warehouses with a package called TeradataR.

This turns the Teradata Warehouse Miner tool into a plug-in for the R console, allowing for 44 different analytical functions in Teradata databases, as well as any stored procedures in data warehouses to be exposed to R and called from R programs. There are another 20 functions that let R work in the Teradata environment.

The idea is to stay within the R console and run the analytics in parallel on the database, instead of trying to suck information down into a workstation and running R locally.

Oracle joins in

Even Oracle is getting in on the R act. In February, the company launched Advanced Analytics, a bridge between Oracle databases and the R analytical engine.

Advanced Analytics is Oracle's Data Mining add-on for its 11g R2 database. When R programmers want to run a statistical routine, they call the equivalent SQL function in the Data Mining toolbox and run that against the database.

If there is no such SQL function, then an embedded R engine spread across database nodes (if it is a cluster) runs the R routines, collects up summary data and presents it back to the R console as an answer.

Oracle also ships something called the R Connector for Hadoop for its Big Data Appliance, a version of the Cloudera CDH3 Hadoop environment running on Oracle's Exa x86 cluster iron.

This connector lets an R console talk to Hadoop Distributed File System and NoSQL databases running on the Big Data Appliance. ®

Security and trust: The backbone of doing business over the internet

More from The Register

next story
Phones 4u slips into administration after EE cuts ties with Brit mobe retailer
More than 5,500 jobs could be axed if rescue mission fails
Driving with an Apple Watch could land you with a £100 FINE
Bad news for tech-addicted fanbois behind the wheel
Phones 4u website DIES as wounded mobe retailer struggles to stay above water
Founder blames 'ruthless network partners' for implosion
Sony says year's losses will be FOUR TIMES DEEPER than thought
Losses of more than $2 BILLION loom over troubled Japanese corp
Radio hams can encrypt, in emergencies, says Ofcom
Consultation promises new spectrum and hints at relaxed licence conditions
Why Oracle CEO Larry Ellison had to go ... Except he hasn't
Silicon Valley's veteran seadog in piratical Putin impression
Big Content Australia just blew a big hole in its credibility
AHEDA's research on average content prices did not expose methodology, so appears less than rigourous
Bono: Apple will sort out monetising music where the labels failed
Remastered so hard it would be difficult or impossible to master it again
prev story


Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.