Big Data's big issue: Where are all the data scientists coming from?
This personnel gap isn't just a job-title change
Analysis Plug “data scientist” into Google and it is clear the job title has finally come of age and, suddenly there is a huge skills shortage.
An oft-quoted source about this shortage is a McKinsey Global Institute study, here. This predicts a talent gap of 140,000 to 190,000 people by 2018 in the US alone. I am always sceptical of IT projections more than 18 months ahead (let alone six years) but I am convinced there is currently a huge skills shortage that is not going away in the next 17 months and 29 days.
So, what is a data scientist? My favourite description comes from Twitter: “Yeah, so I'm actually a data scientist. I just do this barista thing in between gigs.” More cynically: “A data scientist is just an analyst who lives in California.”
Possibly more accurate is that a data scientist (DS) is “a better software engineer than any statistician and a better statistician than any software engineer”. In other words, an important part of the job is to be able to design novel analytical algorithms for specific sets of data and then be able to implement that algorithm in the appropriate computer language.
Data scientists excel at analysing data, particularly large amounts of data that does not fit easily into tabular structures, so called "Big Data."
For example, you should be able to point a data scientist at a web log and say: “Find the different patterns of behaviour in our users.” Or think about oil rigs for a moment. Breaking a drill bit during DIY work is irritating; in the middle of the North Sea it is annoying and very, very expensive. But if you collect enough sensor data (such as temperature, vibrations and RPM) you eventually have data for both normal running and breakages. You then point a data scientist at the data and say: “Build a system that predicts breakages before they happen.”
Data scientists are part artist and part engineer. They need a toolbox of techniques, skills, processes and abilities from which to construct novel solutions. And they need the ability to create a user interface that turns their abstract finding into something that the users of the system can understand, so data scientists also need the skills to create elegant visualisations that turn raw data into information. And they need to be able to communicate well with people. There is little use in creating a superb analytical process if you can’t communicate how and why it works to the board members.
And then there is the curiosity. Duncan Ross, director of data sciences at Teradata characterised data scientists well: “The first and most important trait is curiosity. Insane curiosity. In many walks of life evolution selects against the kind of person who decides to find out what happens 'if I push that button'. Data Science selects for it.”
So, what are the general characteristics of a DS?
They include: insatiable curiosity (see above), interdisciplinary interests, excellent communication skills and excellent analytical capabilities. Data scientists also need a good working knowledge of machine learning techniques, data mining, statistics, maths, algorithm development, code development, data visualisation and multi-dimensional database design and implementation.
Specific skills include the technologies to handle Big Data: NoSQL databases, Hadoop and related technologies and MapReduce and its implementation on differing software platforms. Data scientists also have an intimate knowledge of languages such as SQL, MDX, R and Functional and OOP languages such as Erlang and Java.
Data scientists will be required wherever large sets of data need to be analysed. This is true in the scientific world of course, but that is where the title is somewhat misleading because they are also needed in commercial organisations, in organisations like the NHS, government departments, defence and so on.
So where are all the data scientists going to come from? We’ve been "doing" data science at the School of Computing at the University of Dundee where I am chair of analytics, working with sets of Big Data as diverse as the output from mass spectrometers, image processing, web logs, data collected by games companies and so on.
This year, to run in parallel with our existing part-time Masters in BI, we are introducing a part-time Masters in Data Science. Most of the course is remote study because it is specifically designed for people already in employment in the database/analytical world who want to move into data science.
Fashions come and fashions go, but data scientists (whatever they may be called in the future) will endure. They will endure for the simple reasons that data is complex, the patterns within it are valuable, and spotting the patterns is difficult and requires an unusual mix of skills. ®
Mark Whitehorn holds the chair of analytics at the University of Dundee. His role involves working on data output from mass spectrometers, two-dimensional graphical traces of three-dimensional peaks that must be detected and their volumes calculated. The trick isn’t to do the sums; it’s to do them rapidly because another 8Gbyte output file is always coming.