I used to be an Oracle DBA ... but now I'm a Big Data guru

Dominic Connor shows you how to jump the bandwagon

High performance access to file storage

Patch that bug in your skillset

MapReduce has the same opportunities and hazards and although powerful, it still has the feel of an early version of something that will one day be much smoother, exposing developers to a sometimes uncomfortable mix of high and low level concepts in networking and data science. But that’s the main game in town. Your VB/Java/SQL skills will only partly prepare you for the learning curve, but the main challenge will be to start thinking about parallel execution, something most of us have largely been able to avoid for too long. You will need to fix that bug in your skills, even if you don’t go into Big Data. Look how many cores/threads even an tablet has these days to see why.

If you’re already seeing that Oracle has more yesterdays than tomorrows, then it should be some relief that NoSQL is a very common buzzword in the brave new world and that the main problem you’ll face is that it has a lot fewer features than Oracle or MSSQL.

That’s a common issue here. The tools are not as smooth in Big Data, partly because it’s newish, but mainly because function takes precedence over form in open source projects. These factors also mean that you’re going to spend more time at lower levels of the stack than you’re used to as a DB or Client/Server developer.

Do you know the latency times for the disks on your server? Have you coded to the (really very hard) Infiniband interface? Because the odds are that you’re going to have to start caring about network throughput issues a lot more than you have been, since the fragmentation of clusters on a disk are as nothing when you’re pulling data from an arbitrary number of servers. Odds are you won’t have to do so much of the hands-on at this level, but you can’t ignore it any more.

Big Data, small minds

If we’re optimising for buzzwords here, then you also need to look at web crawling because a lot of what you want to know about your customers is in semi-structured text form on social media - as Raytheon has recently disclosed, not always intentionally. Starting with wget and working through Ajax development you can suck down consumer data. Alternatively, the system you use may have such an awful API that shredding the web pages is actually easier.

Since this is going to lead to a more than slightly complex set of technologies, you might want to look at Teradata’s Aster to keep as much as possible working with SQL, lest you spread yourself too thinly writing lots of client code and trying to map XML to relational.

Visible productivity

This is all a waste of time unless it looks good. If you’re a full-time report writer already, you know the form/substance equation, but I’ll spell it out for everyone else: you’re producing data for decision makers, who like the full range of graphs so they can cut & paste them into Powerpoint. By all means practise your skills by dumping it into Excel but it’s not so good at illustrating data where there are lots of variables and in any case everyone’s seen Excel graphs. You ought to look at Matlab, R and Mathematica, which do a better job of heat maps and more exotic data visualisations.

So here’s what you do...

Sticking my neck out a bit (or more accurately, yours) I suggest taking a good look at the Progress DataDirect tools, which cut down the effort of bringing disparate sources together, since although you want to pick a hard task to impress your boss, you have to deliver in a decent timescale. Also, I suspect they will be increasingly frequent buzzword in job ads since they can be programmed through much the same sort of ODBC/JDBC interface you probably already know.

It’s a bad idea to move too much into a toolset that is too ready-cooked or easy-to-use because it will both quickly get crowded and hit limitations on what can be done. So Pig (don’t you love open source names) is also a good bet, already appearing in a lot of job ads and very extensible but easy enough to get started with that you will be able to show some visible productivity early.

Oddly enough, the data mining we see from Facebook is really very primitive, implying that Mark Zuckerberg should have finished his degree, since it is very much second-year rules-based “Introduction to AI” stuff, very deterministic, little in the way of Bayesian inference - or even Markov. But talking to some of my candidates who’ve turned down jobs there, the quality of privacy invasion we will experience will go up over the next few years as they start doing modern (1980s) data science.

Finally, a word of warning: Your self-taught data-mining will throw up lots of correlations, but please remember that correlation is not causation. You can torture numbers until they say anything you want: Audi drivers may be worse than most others, but it is not the car that makes them so - it just attracts arseholes. The correlation without causality point is so important that you should make sure you drop it into conversation at your Big Data interview. ®

Dominic Connor is a City headhunter

High performance access to file storage

More from The Register

next story
Audio fans, prepare yourself for the Second Coming ... of Blu-ray
High Fidelity Pure Audio – is this what your ears have been waiting for?
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
MtGox chief Karpelès refuses to come to US for g-men's grilling
Bitcoin baron says he needs another lawyer for FinCEN chat
Did a date calculation bug just cost hard-up Co-op Bank £110m?
And just when Brit banking org needs £400m to stay afloat
Zucker punched: Google gobbles Facebook-wooed Titan Aerospace
Up, up and away in my beautiful balloon flying broadband-bot
Apple DOMINATES the Valley, rakes in more profit than Google, HP, Intel, Cisco COMBINED
Cook & Co. also pay more taxes than those four worthies PLUS eBay and Oracle
It may be ILLEGAL to run Heartbleed health checks – IT lawyer
Do the right thing, earn up to 10 years in clink
France bans managers from contacting workers outside business hours
«Email? Mais non ... il est plus tard que six heures du soir!»
prev story


Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
HP ArcSight ESM solution helps Finansbank
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.