Software

Dirty data, flogged cores: YES, Microsoft SQL Server R Services has its positives

The language isn't the problem, it's you and your PC

Kid on computer, photo via Shutterstock

The R language has enjoyed a great reputation in statistical computing and graphics for decades. However, it is also known as something for statisticians. Born around the time of Java, PHP and Python, R lags behind all three by a long chalk on the TIOBE rankings.

Yet Microsoft spotted an opportunity in this era of analytics and big data to breach that reputation last year with the introduction in SQL Server 2016 of SQL Server R Services.

The R language is a domain-specific language that specialises in handling data, doing statistics and even making machine learning relatively painless. Microsoft's inclusion of R in SQL Server makes it easier than ever to bring this analytical power to your tabular data.

What makes R special is its ability to handle data easily and to connect to databases with little trouble at all. This lets anyone with a basic understanding of programming begin statistical programming quite easily. Sure, it has its quirks, but then every language does.

It also has a large and ever-growing collection of packages that are easy to install and maintain (on a small scale).

But R is not perfect by any means. And that's the challenge. Sure, it's in SQL Server, but that's not the end of your problems. Moreover, it's not so much the technology per se that's an issue, but rather the kit and caboodle that accompanies R for you and your system.

The language itself is very memory hungry and can easily eat all the available memory on your machine with relatively small datasets. R stores all objects in memory, there's no disk cache without using extra external packages so it's very common to see an error message: "Cannot allocate vector of size..." when the host cannot allocate a big enough contiguous memory block to the process.

You can get round the problem to some extent by splitting your input data, but this will slow down your R process even more. Did I say slow it down more? Being an interpreted language, R was never going to be as fast as C. Tests I've seen suggest that R can be as much as 17 times slower than Python.

If you do want to analyse big data sets, you will need a 64-bit machine with a lot of RAM but you might not get a lot more performance with a multicore server.

The base of R is single threaded, which means on a four-core machine you should only see a 25 per cent of the CPU used. Some of the libraries, though, are multithreaded so your CPU usage mileage will vary.

Here's the real rub: running R Services in SQL Server 2016 is running analysis on your transactional databases. That's your live database, your R code is running inside your production database, eating the CPU cycles and disk access, slowing down your expensive SQL server.

You could use a second server to run R, but then you've got the potential network bottleneck of moving the data back and forth between the machines.

That's only one of the problems. The data inside a transactional database is not designed for analysis; it's likely to be dirty, inconsistent and full of errors. Anyone familiar with data warehousing will know that the process of moving data from a transactional source to the data warehouse will take about 70 per cent of the programming effort. It's not easy and it can be frustrating.

Yes, the act of adding R services to SQL server Microsoft is letting users run statistics on their data, but it's in all likelihood probably dirty data which may be full of inconsistencies and even errors.

In a traditional data warehouse, data will have been cleaned and errors removed to give "one version of the truth" before ending up in an online analytical cube that can be queried very quickly, have machine learning models applied to it and produce meaningful visualisations.

With R in the hands of SQL Server pilots, you risk cutting out the data cleansing and running queries on dirty data, producing flawed results, unless you have the processes in place first to clean the data.

There does exist an analytical language for querying OLAP cubes, the little understood Multidimensional Expressions (MDX) query language for OLAP data. That is designed to do what SQL can't understand – multidimensional databases.

However, MDX is hard to understand and you need to go to the trouble of building a cube in the first place and while it does provide a basic set of statistical functions its provides nothing as clever as machine learning.

If only R services could run against these cubes (inside Microsoft's Analysis Services), advanced machine learning algorithms could be applied to very clean data, specially prepared to give one version of the truth.

Well, it turns out it can be done using a package called olapR that lets you run MDX queries and use the data in R. It's relatively new but it could finally offer the best of both worlds, MDX queries to very clean data with advanced statistical analysis!

Another issue for newbies will be the need to get your head around the practicalities of R being a domain-specific language. You will almost certainly need an understanding of statistics to get meaningful answers out of your code. Never has the phrase garbage in/garbage out been more relevant. There are a large number of statistical methods and formula, each designed to be used in a certain way and will work best on certain types of data sets.

Knowing what method to use on what data comes from experience and a deep understanding of statistical methods.

Oh, and what of the actual results of your calculations? Is the result meaningful in the real world or just an artefact of a baldy collected dataset? That's another set of issues.

Let's not be too harsh on R. Before this language burst on to the scene, if you wanted to do statistics most likely you would be pointed to SPPS, a general-purpose stats package now marketed by IBM that generations of Social Science students have grown up with since 1968.

The problem was that SPSS was overly complicated, slow, and – more importantly – expensive. SPSS still has an active life as part of IBM's Business Analytics but – again – that's not cheap. Other commercial alternatives do exist – the SAS Statistical Analysis for instance – and you will also find some "free" alternatives such as JASP for instance. The problem is, for programmers they all have a steep learning curve and probably a GUI that is just too complicated. On the other hand, because of the expensive licence you will get a lot of technical support from IBM or SAS, which you won't with R or Python.

If you are a programmer, you could just jump straight into Python. It provides a friendly programming style that's easy to read (although some would argue that using white spaces for indentation is the work of the Devil) and looks like most of the languages you've probably used. Best of all, there are a large number of statistical and machine-learning packages available and ready to use – Numpy/Scipy, pymc and Pandas spring to mind.

The problem – yes, of course there had to be a problem – is Python is a general-purpose programming language. Python's libraries make it capable of doing an excellent job of statistics and machine learning – and many programmers have had great success using it this way – but it's just not built to do this particular job. Python's syntax hasn't been built for handing data for analysis and the sad truth is that the most up-to-date algorithms will be published in the R language. As a general-purpose language, Python does have the advantage that it's a lot easier to embed your analytics into a web service or standalone programs.

It's no surprise then that Microsoft identified the rise of R, first released in 1993, as something good for SQL server 2016. And, to a great extent, it's added R in a way consistent with Redmond's history of bringing complicated technologies to a wider audience.

Microsoft's implementation lets programmers embed R code directly into stored procedures, which will run inside the server. Your application can then call the procedure directly and access the answer from a table of results. The code running in the stored procedure has access to the full R language and R libraries, which means you can build advanced machine learning directly inside your SQL server.

For all the criticism of the R language – it's hard to understand, slow and a memory hog – it remains the language of choice for numerical data analysis. R services in SQL server offers the potential for complex analysis on data in a way easier than every before.

Just know what you're getting into first. ®