Feeds

I used to be an Oracle DBA ... but now I'm a Big Data guru

Dominic Connor shows you how to jump the bandwagon

3 Big data security analytics techniques

Patch that bug in your skillset

MapReduce has the same opportunities and hazards and although powerful, it still has the feel of an early version of something that will one day be much smoother, exposing developers to a sometimes uncomfortable mix of high and low level concepts in networking and data science. But that’s the main game in town. Your VB/Java/SQL skills will only partly prepare you for the learning curve, but the main challenge will be to start thinking about parallel execution, something most of us have largely been able to avoid for too long. You will need to fix that bug in your skills, even if you don’t go into Big Data. Look how many cores/threads even an tablet has these days to see why.

If you’re already seeing that Oracle has more yesterdays than tomorrows, then it should be some relief that NoSQL is a very common buzzword in the brave new world and that the main problem you’ll face is that it has a lot fewer features than Oracle or MSSQL.

That’s a common issue here. The tools are not as smooth in Big Data, partly because it’s newish, but mainly because function takes precedence over form in open source projects. These factors also mean that you’re going to spend more time at lower levels of the stack than you’re used to as a DB or Client/Server developer.

Do you know the latency times for the disks on your server? Have you coded to the (really very hard) Infiniband interface? Because the odds are that you’re going to have to start caring about network throughput issues a lot more than you have been, since the fragmentation of clusters on a disk are as nothing when you’re pulling data from an arbitrary number of servers. Odds are you won’t have to do so much of the hands-on at this level, but you can’t ignore it any more.

Big Data, small minds

If we’re optimising for buzzwords here, then you also need to look at web crawling because a lot of what you want to know about your customers is in semi-structured text form on social media - as Raytheon has recently disclosed, not always intentionally. Starting with wget and working through Ajax development you can suck down consumer data. Alternatively, the system you use may have such an awful API that shredding the web pages is actually easier.

Since this is going to lead to a more than slightly complex set of technologies, you might want to look at Teradata’s Aster to keep as much as possible working with SQL, lest you spread yourself too thinly writing lots of client code and trying to map XML to relational.

Visible productivity

This is all a waste of time unless it looks good. If you’re a full-time report writer already, you know the form/substance equation, but I’ll spell it out for everyone else: you’re producing data for decision makers, who like the full range of graphs so they can cut & paste them into Powerpoint. By all means practise your skills by dumping it into Excel but it’s not so good at illustrating data where there are lots of variables and in any case everyone’s seen Excel graphs. You ought to look at Matlab, R and Mathematica, which do a better job of heat maps and more exotic data visualisations.

So here’s what you do...

Sticking my neck out a bit (or more accurately, yours) I suggest taking a good look at the Progress DataDirect tools, which cut down the effort of bringing disparate sources together, since although you want to pick a hard task to impress your boss, you have to deliver in a decent timescale. Also, I suspect they will be increasingly frequent buzzword in job ads since they can be programmed through much the same sort of ODBC/JDBC interface you probably already know.

It’s a bad idea to move too much into a toolset that is too ready-cooked or easy-to-use because it will both quickly get crowded and hit limitations on what can be done. So Pig (don’t you love open source names) is also a good bet, already appearing in a lot of job ads and very extensible but easy enough to get started with that you will be able to show some visible productivity early.

Oddly enough, the data mining we see from Facebook is really very primitive, implying that Mark Zuckerberg should have finished his degree, since it is very much second-year rules-based “Introduction to AI” stuff, very deterministic, little in the way of Bayesian inference - or even Markov. But talking to some of my candidates who’ve turned down jobs there, the quality of privacy invasion we will experience will go up over the next few years as they start doing modern (1980s) data science.

Finally, a word of warning: Your self-taught data-mining will throw up lots of correlations, but please remember that correlation is not causation. You can torture numbers until they say anything you want: Audi drivers may be worse than most others, but it is not the car that makes them so - it just attracts arseholes. The correlation without causality point is so important that you should make sure you drop it into conversation at your Big Data interview. ®

Dominic Connor is a City headhunter

Top three mobile application threats

More from The Register

next story
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Audio fans, prepare yourself for the Second Coming ... of Blu-ray
High Fidelity Pure Audio – is this what your ears have been waiting for?
Zucker punched: Google gobbles Facebook-wooed Titan Aerospace
Up, up and away in my beautiful balloon flying broadband-bot
Ex–Apple CEO John Sculley: Ousting Steve Jobs 'was a mistake'
Twenty-nine years later, post-Pepsi exec has flat-forehead moment
Apple DOMINATES the Valley, rakes in more profit than Google, HP, Intel, Cisco COMBINED
Cook & Co. also pay more taxes than those four worthies PLUS eBay and Oracle
Number crunching suggests Yahoo! US is worth less than nothing
China and Japan holdings worth more than entire company
Intel sees 'signs of improvement in the PC business' but earnings remain 'Meh...'
Prospects for the future, however, please Wall Street money men
prev story

Whitepapers

SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.