Feeds

Big Data's big issue: Where are all the data scientists coming from?

This personnel gap isn't just a job-title change

Providing a secure and efficient Helpdesk

Analysis Plug “data scientist” into Google and it is clear the job title has finally come of age and, suddenly there is a huge skills shortage.

An oft-quoted source about this shortage is a McKinsey Global Institute study, here. This predicts a talent gap of 140,000 to 190,000 people by 2018 in the US alone. I am always sceptical of IT projections more than 18 months ahead (let alone six years) but I am convinced there is currently a huge skills shortage that is not going away in the next 17 months and 29 days.

So, what is a data scientist? My favourite description comes from Twitter: “Yeah, so I'm actually a data scientist. I just do this barista thing in between gigs.” More cynically: “A data scientist is just an analyst who lives in California.”

Possibly more accurate is that a data scientist (DS) is “a better software engineer than any statistician and a better statistician than any software engineer”. In other words, an important part of the job is to be able to design novel analytical algorithms for specific sets of data and then be able to implement that algorithm in the appropriate computer language.

Data scientists excel at analysing data, particularly large amounts of data that does not fit easily into tabular structures, so called "Big Data."

For example, you should be able to point a data scientist at a web log and say: “Find the different patterns of behaviour in our users.” Or think about oil rigs for a moment. Breaking a drill bit during DIY work is irritating; in the middle of the North Sea it is annoying and very, very expensive. But if you collect enough sensor data (such as temperature, vibrations and RPM) you eventually have data for both normal running and breakages. You then point a data scientist at the data and say: “Build a system that predicts breakages before they happen.”

Data scientists are part artist and part engineer. They need a toolbox of techniques, skills, processes and abilities from which to construct novel solutions. And they need the ability to create a user interface that turns their abstract finding into something that the users of the system can understand, so data scientists also need the skills to create elegant visualisations that turn raw data into information. And they need to be able to communicate well with people. There is little use in creating a superb analytical process if you can’t communicate how and why it works to the board members.

And then there is the curiosity. Duncan Ross, director of data sciences at Teradata characterised data scientists well: “The first and most important trait is curiosity. Insane curiosity. In many walks of life evolution selects against the kind of person who decides to find out what happens 'if I push that button'. Data Science selects for it.”

So, what are the general characteristics of a DS?

They include: insatiable curiosity (see above), interdisciplinary interests, excellent communication skills and excellent analytical capabilities. Data scientists also need a good working knowledge of machine learning techniques, data mining, statistics, maths, algorithm development, code development, data visualisation and multi-dimensional database design and implementation.

Specific skills include the technologies to handle Big Data: NoSQL databases, Hadoop and related technologies and MapReduce and its implementation on differing software platforms. Data scientists also have an intimate knowledge of languages such as SQL, MDX, R and Functional and OOP languages such as Erlang and Java.

Data scientists will be required wherever large sets of data need to be analysed. This is true in the scientific world of course, but that is where the title is somewhat misleading because they are also needed in commercial organisations, in organisations like the NHS, government departments, defence and so on.

So where are all the data scientists going to come from? We’ve been "doing" data science at the School of Computing at the University of Dundee where I am chair of analytics, working with sets of Big Data as diverse as the output from mass spectrometers, image processing, web logs, data collected by games companies and so on.

This year, to run in parallel with our existing part-time Masters in BI, we are introducing a part-time Masters in Data Science. Most of the course is remote study because it is specifically designed for people already in employment in the database/analytical world who want to move into data science.

Fashions come and fashions go, but data scientists (whatever they may be called in the future) will endure. They will endure for the simple reasons that data is complex, the patterns within it are valuable, and spotting the patterns is difficult and requires an unusual mix of skills. ®

Mark Whitehorn holds the chair of analytics at the University of Dundee. His role involves working on data output from mass spectrometers, two-dimensional graphical traces of three-dimensional peaks that must be detected and their volumes calculated. The trick isn’t to do the sums; it’s to do them rapidly because another 8Gbyte output file is always coming.

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
The 'fun-nification' of computer education – good idea?
Compulsory code schools, luvvies love it, but what about Maths and Physics?
Want to break Netflix? It'll pay you to do the job
'Senior Chaos Engineer' sought to inflict all sorts of nasty, nasty, pain
HOT BABES! Worried you won't get that JOB in IT? MENTION how hot you are
'Don't hate me 'cos I'm beautiful' ploy for sad honeys
Oracle to DBAs: your certification is about to become worthless paper
So hurry up and get a new one, will all of you who took exams for 10g and lower?
HP's axe swings AGAIN: 5,000 more staffers for the chop
Extra job cuts not linked to PC and printer biz split
Phones 4u demise: 1,700 employees laid off with redundo package
'Limited interest in remaining 362 stores', says administrator PwC
Germany strikes again over Amazon warehouse pay
Employees to walk out in long-running wage dispute
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.