R is ready for big data

Take the open road to statistical analysis

New hybrid storage solutions

Statistical analysis has been around since mainframes were introduced to academia and corporations back in the 1960s.

But the great diversity of telemetry collected by systems today, the need to sift through it for insight and the growing popularity of open-source alternatives is transforming the R programming language for statistical analysis and visualisation. Its new nickname is Red Hat for stats.

Everybody loves R, particularly those selling big-data products such as data warehouses and Hadoop data munchers.

Part of the reason is that R is an open source package that solicits input from a large and clever community of statisticians and quantitative analysts who are able to steer its development.

Alphabet soup

This was not the case for proprietary tools created by SAS Institute and SPSS at the dawn of the mainframe era, and their follow-ons in the distributed computing era.

Just as Linux can be thought of as an open-source analog to Unix, the R programming language borrows heavily from the S language.

This was created by John Chambers at Bell Labs in 1976, as a reaction to the pricey but well respected SPSS and SAS tools that came out nearly a decade earlier.

S is very much a child of the VAX and Unix minicomputer era, while R is a product of the PC and Linux era.

The R language was created in 1996 by Ross Ihaka and Robert Gentleman, two stats professors from the University of Auckland in New Zealand who are still core members of the R development team. (Incidentally, so is Chambers, the creator of S, and it is no accident that some data crunching routines for S will run unchanged in the R environment.)

R can be thought of as a modern implementation of S. So can S-PLUS, created by a company called Insightful, which licensed S from Lucent Technologies in 2004 and was eaten by Tibco Software in 2008.

Come the revolution

Unlike S and to a certain extent S-PLUS, R is not just some code created in an ivory tower.

It is the product of a community of statisticians and coders which has created more than 2,500 plug-ins for chewing on various data sets and doing statistical analysis tuned specifically for particular data types or industries.

R is used by more than two million quantitative analysts worldwide, according to estimates made by Revolution Analytics, which was founded in 2007 to create a parallel implementation of R.

Since then, the company has taken an open-core approach to R, offering commercial support for the open-source package, while at the same time extending the R environment to run better on clusters of machines and in conjunction with Hadoop clusters.

To date, no one has commercialised the PSPP open-source alternative to SPSS (acquired by IBM in July 2009), but it would not be surprising to see this happen at some point, if PSSP matures.

Revolution Analytics has not exactly made the R community happy by peddling proprietary extensions to R in its R Enterprise distribution, after getting some seed money from Intel Capital in 2008 and $9m in venture money in 2009.

Since then, Revolution Analytics has parallelised the underlying R statistical engine so it runs better on multicore/multithreaded processors and across server clusters; added a NoSQL-like format called XDF to help parallelise data sets; and added support for native SAS file formats and conversion to XDF

Most recently it has tweaked its R implementation so each node in a Hadoop cluster can run R analytics locally on the Hadoop cluster on data stored in the Hadoop Distributed File System and then aggregate the results of those calculations, much like MapReduce operations on unstructured data.

Revolution Analytics has soaked up a lot of the oxygen in the R room for the past few years. But other companies are doing interesting things, integrating R tools with their own products and making life easier for analysts seeking answers in mountains of data.

Parallel universe

Seeking some kind of advantage over its rivals in the data warehousing space, Netezza opened up the Netezza software stack in February 2010.

Netezza is a maker of data warehousing appliances based on a heavily customised and parallelised version of the PostgreSQL database, which uses field programmable gate arrays (FPGAs) to boost its performance running on x86 clusters.

Netezza opened up its software development environment with a set of APIs that allow SAS and R algorithms to run in parallel on its warehouse appliances.

It also similarly offered hooks for Java, C++, Fortran, or Python applications to reach into the data warehouse and use the FPGAs to extract data stored in the warehouse rather than using the SQL database query language.

Seven months later, as it became clearer that big data was going to be big business, IBM snapped up privately held Netezza for a cool $1.7bn.

In October 2010, data warehouse maker Teradata added its own in-database analytics to its eponymous data warehouses with a package called TeradataR.

This turns the Teradata Warehouse Miner tool into a plug-in for the R console, allowing for 44 different analytical functions in Teradata databases, as well as any stored procedures in data warehouses to be exposed to R and called from R programs. There are another 20 functions that let R work in the Teradata environment.

The idea is to stay within the R console and run the analytics in parallel on the database, instead of trying to suck information down into a workstation and running R locally.

Oracle joins in

Even Oracle is getting in on the R act. In February, the company launched Advanced Analytics, a bridge between Oracle databases and the R analytical engine.

Advanced Analytics is Oracle's Data Mining add-on for its 11g R2 database. When R programmers want to run a statistical routine, they call the equivalent SQL function in the Data Mining toolbox and run that against the database.

If there is no such SQL function, then an embedded R engine spread across database nodes (if it is a cluster) runs the R routines, collects up summary data and presents it back to the R console as an answer.

Oracle also ships something called the R Connector for Hadoop for its Big Data Appliance, a version of the Cloudera CDH3 Hadoop environment running on Oracle's Exa x86 cluster iron.

This connector lets an R console talk to Hadoop Distributed File System and NoSQL databases running on the Big Data Appliance. ®

The next step in data security

More from The Register

next story
Phones 4u slips into administration after EE cuts ties with Brit mobe retailer
More than 5,500 jobs could be axed if rescue mission fails
JINGS! Microsoft Bing called Scots indyref RIGHT!
Redmond sporran metrics get one in the ten ring
Driving with an Apple Watch could land you with a £100 FINE
Bad news for tech-addicted fanbois behind the wheel
Murdoch to Europe: Inflict MORE PAIN on Google, please
'Platform for piracy' must be punished, or it'll kill us in FIVE YEARS
Phones 4u website DIES as wounded mobe retailer struggles to stay above water
Founder blames 'ruthless network partners' for implosion
Found inside ISIS terror chap's laptop: CELINE DION tunes
REPORT: Stash of terrorist material found in Syria Dell box
Sony says year's losses will be FOUR TIMES DEEPER than thought
Losses of more than $2 BILLION loom over troubled Japanese corp
prev story


Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Protecting users from Firesheep and other Sidejacking attacks with SSL
Discussing the vulnerabilities inherent in Wi-Fi networks, and how using TLS/SSL for your entire site will assure security.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.