Revolution lets R do stats on big data
Scalability boost, too
If you've got big data, then R will soon be able to chew on it and spit out some answers.
Revolution Analytics was formed in May to become the 'Red Hat for stats', funding development for the open source R statistical programming language and offering a commercially supported, open core variant for enterprise customers with some of the bells and whistles that are missing from the open source R package. At the time of the launch, the company said that it was working to allow R to scale better within a server and across servers and to give it extensions to analyze big data sets commonly stored in NoSQL, Hadoop and other formats.
Today, Revolution Analytics will preview Revolution R Enterprise V4, its future release which is in beta now and which should ship by the end of August, according to Jeff Erhardt, chief operating officer at the company. With the V4 update, R Enterprise gets two things.
First, the guts of the R code have been changed to understand threading better and scale across clusters if need be, not just try to work on a couple of threads and the main memory available on a single system. R Enterprise V4 has been tweaked to allow calculations normally undertaken on a single workstation in R (and usually not across very many threads) to be distributed across threads within a CPU core, multiple CPUs within a system, or multiple systems in a cluster.
David Champagne, who was the principal architect and engineer at SPSS (now owned by IBM) and is now chief technology officer at Revolution Analytics, says that on a single machine the scalability tweaks are based on the company's own threading code, not openMP or some other code. For distributed computing across a network of machines, the tweaks R relies on remote procedure calls (RPC) to communicate between the nodes as they chew on data. "We are looking at possibly changing this in the future to use something like MPI," says Champagne. MPI, of course, is the Message Passing Interface protocol that parallel supercomputers use to pass data and distribute HPC work across clusters.
The other big change coming with R Enterprise V4 is a binary big data format called XDF, which Erhardt says is loosely based on NoSQL. (Which is funny, because NoSQL is, by definition, a pretty loose definition to describe a whole bunch of non-relational data stores.) The important thing is that the XDF format for R Enterprise allows users to do data chunking and to provide very high-speed data access to arbitrary rows, columns, and blocks in the store. R Enterprise V4 has tools to pull data into the new XDF format and can also then spread calculations across multiple threads, cores, CPUs, and machines to scale up the performance of analysis on big data sets.
The new XDF data store and scalability enhancements will be in a priced feature called Revo Scale R, which is an add-on module for R Enterprise V4. Customers who have bought R Enterprise V3.X releases and who are on current maintenance contracts will get the upgrade to V4 as well as the Revo Scale R module for free, says Erhardt. New customers will have to pay an incremental fee for the new big data and scalability enhancements.
Revolution Analytics is a bit vague about pricing, but says that for a single user working at a workstation, R Enterprise runs a few thousand dollars; for a server with a reasonable number of cores and sockets, it's on the order of $25,000 for a license. R Enterprise runs on Microsoft Windows Server 2003 and 2008 and on Red Hat Enterprise Linux 5. The Revo Scale R add-on is only initially available for Windows platforms, but will be available for RHEL soon, probably early in the fourth quarter according to Erhardt.
Since the relaunch of the company in May - Revolution used to be a consultancy before it became an R distro - the company has boosted its customer base by 60 per cent, to 120. Quantitative finance and big pharma were always strong suits for the open source R language, but now Erhardt says companies in retail, telecomms, media and entertainment, and information services are all coming to Revolution Analytics to talk about R Enterprise and the extensions they are looking for.
One of the things that Revolution Analytics is cooking up is a web services platform, which will allow the part of R analysts used to create algorithms from doing analysis to be physically separated from the machines where the calculations are run. The idea is to allow for heavy calculations to be deployed to cloudy infrastructure. And because a lot of quants have built their models in Excel spreadsheets, the company has already demonstrated the ability to have R-based analytics executed from buttons in Excel but have the calculations on the data stored in spreadsheets to be done on a cloud of machines - and do the math a lot faster. ®