Original URL: http://www.theregister.co.uk/2013/02/20/big_data_careers/

I used to be an Oracle DBA ... but now I'm a Big Data guru

Dominic Connor shows you how to jump the bandwagon

By Dominic Connor

Posted in CIO, 20th February 2013 09:36 GMT

As the demand for Oracle skills fades along with VB and as even Java loses its shine, the smart developer is looking at what will pay the bills for the next decade.

As an ITpro you have to bet your career every few years and Big Data is too obvious an opportunity to pass up. The problem being that it’s not a single product that you can cross-train into, but rather a new set of methods as well as tools.

You can’t handle the (Big) truth

Apparently I’m not the most valuable writer at The Reg, our wise leadership has crossed sales of adverts with click data to work that out. However, I’m too petty to tell you who is #1 because that is actionable data, since a rational response is a positive feedback loop pushing the parts of your business that make money and fixing or junking parts that don’t.

Talking to Duncan Ross, who evangelises Big Data for Teradata, I discover that the first skill you need to develop is the mindset of working out what would be good to know so that you can use it. Being interesting is necessary but not sufficient.

This illustrates that like most industry buzzwords, “Big” is a cynical spin on what should more often be called “awkward”. A billion rows of normalised data in a well-designed schema might be large, but it's hard to see a big pay packet coming from running an aggregate function like Average on it. To be a lucrative skill there has to be some pain that you can overcome for your patient employer.

Crossing databases can be deliciously hard and that’s why I commend it to you as a skill in which to invest. Easy skills may shine brightly for a while, but market forces (people reading this article and job ads) will drag your price down. This is far more than getting Oracle to coexist with SQL Server using toolsets like DTS, because increasingly a large percentage of your most useful data is in clouds - like BlackLine for accounts or Salesforce for (surprise) sales which can use SQL, as does your in-house HR DB. Be clear that they are each different flavours of SQL, with entities not in any kind of mutually consistent form. Also note that inter-cloud interoperability is still at a laughably primitive stage, so that’s your opening. Part of the joy of integration is that the set multiplies with each system you add to the mix - meaning there is lots of work to be done for which they will pay you.

We’ve been crossing databases together for decades, often with extra indexes, or massive slow batch imports of data followed by an equally slow (and often less than reliable) batch download. That’s not only clunky, but is rapidly going downhill as a career option and because it’s just you and me here, we can be honest about this.

YOU know how the Fat Controller (yes, it’s called that at one bank) sucks in data and no one else does, not well enough that they’d ever dare lose you. That’s good until the day that some jerk whispers “Hadoop” into the ear of your CIO. Then what you have are non-portable skills that can only be used in one firm which is phasing them out. This is so bad that you must be that jerk. It's better to be leading the charge than be ridden over and even if the project goes titsup, Hadoop is a pretty good word to have on your CV.

Hadoop’s open-sourciness and use of commodity hardware is both good and bad for your CV upgrade because this means the barrier to entry is a bit lower than it would be if learning the skill required access to a huge pile of expensive hardware and software. This means the clock has already started ticking quietly while the other seven million people who read this article pile in... Nevertheless, the value of this information won’t degrade anything like as fast as your expertise in proprietary system. As IT Pros we're always walking up the down escalator, but it’s best to pick one that’s moving slower and has fewer people in the way.

Patch that bug in your skillset

MapReduce has the same opportunities and hazards and although powerful, it still has the feel of an early version of something that will one day be much smoother, exposing developers to a sometimes uncomfortable mix of high and low level concepts in networking and data science. But that’s the main game in town. Your VB/Java/SQL skills will only partly prepare you for the learning curve, but the main challenge will be to start thinking about parallel execution, something most of us have largely been able to avoid for too long. You will need to fix that bug in your skills, even if you don’t go into Big Data. Look how many cores/threads even an tablet has these days to see why.

If you’re already seeing that Oracle has more yesterdays than tomorrows, then it should be some relief that NoSQL is a very common buzzword in the brave new world and that the main problem you’ll face is that it has a lot fewer features than Oracle or MSSQL.

That’s a common issue here. The tools are not as smooth in Big Data, partly because it’s newish, but mainly because function takes precedence over form in open source projects. These factors also mean that you’re going to spend more time at lower levels of the stack than you’re used to as a DB or Client/Server developer.

Do you know the latency times for the disks on your server? Have you coded to the (really very hard) Infiniband interface? Because the odds are that you’re going to have to start caring about network throughput issues a lot more than you have been, since the fragmentation of clusters on a disk are as nothing when you’re pulling data from an arbitrary number of servers. Odds are you won’t have to do so much of the hands-on at this level, but you can’t ignore it any more.

Big Data, small minds

If we’re optimising for buzzwords here, then you also need to look at web crawling because a lot of what you want to know about your customers is in semi-structured text form on social media - as Raytheon has recently disclosed, not always intentionally. Starting with wget and working through Ajax development you can suck down consumer data. Alternatively, the system you use may have such an awful API that shredding the web pages is actually easier.

Since this is going to lead to a more than slightly complex set of technologies, you might want to look at Teradata’s Aster to keep as much as possible working with SQL, lest you spread yourself too thinly writing lots of client code and trying to map XML to relational.

Visible productivity

This is all a waste of time unless it looks good. If you’re a full-time report writer already, you know the form/substance equation, but I’ll spell it out for everyone else: you’re producing data for decision makers, who like the full range of graphs so they can cut & paste them into Powerpoint. By all means practise your skills by dumping it into Excel but it’s not so good at illustrating data where there are lots of variables and in any case everyone’s seen Excel graphs. You ought to look at Matlab, R and Mathematica, which do a better job of heat maps and more exotic data visualisations.

So here’s what you do...

Sticking my neck out a bit (or more accurately, yours) I suggest taking a good look at the Progress DataDirect tools, which cut down the effort of bringing disparate sources together, since although you want to pick a hard task to impress your boss, you have to deliver in a decent timescale. Also, I suspect they will be increasingly frequent buzzword in job ads since they can be programmed through much the same sort of ODBC/JDBC interface you probably already know.

It’s a bad idea to move too much into a toolset that is too ready-cooked or easy-to-use because it will both quickly get crowded and hit limitations on what can be done. So Pig (don’t you love open source names) is also a good bet, already appearing in a lot of job ads and very extensible but easy enough to get started with that you will be able to show some visible productivity early.

Oddly enough, the data mining we see from Facebook is really very primitive, implying that Mark Zuckerberg should have finished his degree, since it is very much second-year rules-based “Introduction to AI” stuff, very deterministic, little in the way of Bayesian inference - or even Markov. But talking to some of my candidates who’ve turned down jobs there, the quality of privacy invasion we will experience will go up over the next few years as they start doing modern (1980s) data science.

Finally, a word of warning: Your self-taught data-mining will throw up lots of correlations, but please remember that correlation is not causation. You can torture numbers until they say anything you want: Audi drivers may be worse than most others, but it is not the car that makes them so - it just attracts arseholes. The correlation without causality point is so important that you should make sure you drop it into conversation at your Big Data interview. ®

Dominic Connor is a City headhunter