Revolution Analytics paints R stats Azure blue
Gooses performance, spans HPC clusters with 6.0 update
Revolution Analytics, aka "Red Hat for stats" – which commercialized the open source R programming language and statistical analysis tool – has now tweaked its R Enterprise stack and pushed out a 6.0 release.
The new R Enterprise 6.0 is based on the R 2.14.2 engine, which is the latest stable release of the open source code, according to David Smith, head of marketing at the company. This code was released on February 29, with the 2.15.0 update just coming out on May 30 and not quite ready for inclusion in R Enterprise. You can see the full release notes for R 2.14.2 here.
The big new feature this edition of the R engine gives to users is a byte compiler that has been added to the engine. Similar to Java byte codes for a Java virtual machine, the byte compiler compiles the interpreted R code down to an intermediate stage before it executes in the R engine, which can speed up the operations by the R interpreter by around 30 per cent, according to Smith. This byte compiler for the R interpreter has no effect whatsoever on any number-crunching that the R engine needs to do, since this is not done by the interpreter but by another part of the engine. So obviously the performance improvements you will see from the new R engine will depend on the nature of the statistical algorithms you are running.
The update also has support for Generalized Linear Models, or GLMs, in stat speak. These include Logistic (Binomial) Poisson, Gamma, and Tweedie models, which are all supported with a high-performance C++ implementation, according to Smith.
A new feature of the R Enterprise 6.0 is integration with IBM's new Platform LSF V8.3 scheduler for HPC grids, which allows for R routines to be parallelized and run on a cluster of x86 iron. Put the two together – GLMs and HPC clusters – and you can get a significant speedup in many cases, according to Smith.
In the case of an insurance company that was doing "tweedie distribution" analysis against 30 million claims using the SAS stats package on a big SMP server, the job took eight hours to run. During beta tests, this customer fired up R Enterprise on an eight-node x86 cluster, used Platform LSF to dispatch work to the nodes from a workstation running R Enterprise and with the nodes running R Enterprise as well, and the job finished in 10 minutes. While the shortening of the time to complete the job is important, what is perhaps more important is the ability to iterate models quickly and improve them because the job runs so much faster. You need to be running Red Hat Enterprise Linux and R Enterprise on server nodes if you want Platform LSF to dispatch work in parallel to them.
The 6.0 release also includes the ability to read SAS and SPSS native file formats directly as well as sucking in raw ASCII text data and information sucked out of relational databases using ODBC to have it analyzed. In the past, R Enterprise had to convert this data to its own XDF NoSQL-like data store, and on data that is constantly changing, this reformatting is a pain in the neck. Now, you can just use the native data sets and, perhaps more importantly, not have to worry about having a license for SAS or SPSS if you have moved off those platforms to open-core R Enterprise tools.
Revolution Analytics already supported the running of its R stack inside of Amazon's EC2 compute clouds, and Smith tells El Reg that the company has "quite a few" customers that run R Enterprise in the cloud, and that all of the proof of concepts that Revolution Analytics does with customers run on EC2 as well.
But not everyone uses EC2. Some people use Microsoft's Azure cloud, and starting with R Enterprise 6.0, you can now fire up R instances on the Azure cloud. At the moment, Revolution Analytics is only supporting the bursting features of Azure, which allows you to dispatch work from inside your firewall to the Microsoft cloud. You cannot run R Enterprise in a standalone fashion on Azure, and you have to have at least one server node running Microsoft's Windows HPC Server to dispatch R work to Azure. You can have more nodes than that in the local cluster, of course, but you need at least one. And this bursting function, which has been in beta testing for the past four months, does not work with either releases of R Enterprise.
R Enterprise comes in two flavors: workstation and server. A workstation edition which is designed for a single user on a single workstation PC costs $1,000 per machine per year for a license. The server edition, which can be used by an unlimited number of end users firing work at the cluster or cloud, costs $30,000 per year for an eight-core x86 server. ®