I used to be an Oracle DBA ... but now I'm a Big Data guru
Dominic Connor shows you how to jump the bandwagon
Patch that bug in your skillset
MapReduce has the same opportunities and hazards and although powerful, it still has the feel of an early version of something that will one day be much smoother, exposing developers to a sometimes uncomfortable mix of high and low level concepts in networking and data science. But that’s the main game in town. Your VB/Java/SQL skills will only partly prepare you for the learning curve, but the main challenge will be to start thinking about parallel execution, something most of us have largely been able to avoid for too long. You will need to fix that bug in your skills, even if you don’t go into Big Data. Look how many cores/threads even an tablet has these days to see why.
If you’re already seeing that Oracle has more yesterdays than tomorrows, then it should be some relief that NoSQL is a very common buzzword in the brave new world and that the main problem you’ll face is that it has a lot fewer features than Oracle or MSSQL.
That’s a common issue here. The tools are not as smooth in Big Data, partly because it’s newish, but mainly because function takes precedence over form in open source projects. These factors also mean that you’re going to spend more time at lower levels of the stack than you’re used to as a DB or Client/Server developer.
Do you know the latency times for the disks on your server? Have you coded to the (really very hard) Infiniband interface? Because the odds are that you’re going to have to start caring about network throughput issues a lot more than you have been, since the fragmentation of clusters on a disk are as nothing when you’re pulling data from an arbitrary number of servers. Odds are you won’t have to do so much of the hands-on at this level, but you can’t ignore it any more.
Big Data, small minds
If we’re optimising for buzzwords here, then you also need to look at web crawling because a lot of what you want to know about your customers is in semi-structured text form on social media - as Raytheon has recently disclosed, not always intentionally. Starting with wget and working through Ajax development you can suck down consumer data. Alternatively, the system you use may have such an awful API that shredding the web pages is actually easier.
Since this is going to lead to a more than slightly complex set of technologies, you might want to look at Teradata’s Aster to keep as much as possible working with SQL, lest you spread yourself too thinly writing lots of client code and trying to map XML to relational.
This is all a waste of time unless it looks good. If you’re a full-time report writer already, you know the form/substance equation, but I’ll spell it out for everyone else: you’re producing data for decision makers, who like the full range of graphs so they can cut & paste them into Powerpoint. By all means practise your skills by dumping it into Excel but it’s not so good at illustrating data where there are lots of variables and in any case everyone’s seen Excel graphs. You ought to look at Matlab, R and Mathematica, which do a better job of heat maps and more exotic data visualisations.
So here’s what you do...
Sticking my neck out a bit (or more accurately, yours) I suggest taking a good look at the Progress DataDirect tools, which cut down the effort of bringing disparate sources together, since although you want to pick a hard task to impress your boss, you have to deliver in a decent timescale. Also, I suspect they will be increasingly frequent buzzword in job ads since they can be programmed through much the same sort of ODBC/JDBC interface you probably already know.
It’s a bad idea to move too much into a toolset that is too ready-cooked or easy-to-use because it will both quickly get crowded and hit limitations on what can be done. So Pig (don’t you love open source names) is also a good bet, already appearing in a lot of job ads and very extensible but easy enough to get started with that you will be able to show some visible productivity early.
Oddly enough, the data mining we see from Facebook is really very primitive, implying that Mark Zuckerberg should have finished his degree, since it is very much second-year rules-based “Introduction to AI” stuff, very deterministic, little in the way of Bayesian inference - or even Markov. But talking to some of my candidates who’ve turned down jobs there, the quality of privacy invasion we will experience will go up over the next few years as they start doing modern (1980s) data science.
Finally, a word of warning: Your self-taught data-mining will throw up lots of correlations, but please remember that correlation is not causation. You can torture numbers until they say anything you want: Audi drivers may be worse than most others, but it is not the car that makes them so - it just attracts arseholes. The correlation without causality point is so important that you should make sure you drop it into conversation at your Big Data interview. ®
Dominic Connor is a City headhunter
Sponsored: Benefits from the lessons learned in HPC