Feeds

Cloudera sends in the auditors – for Hadoop

Giving enterprises what they want: auditing, backup, and rolling upgrades

Combat fraud and increase customer satisfaction

Techies need tools to manage cranky Hadoop clusters, and business managers need to manage and report on access data stored in Hadoop to appease cranky auditors. And so, as part of an update to its CHD4 stack on Tuesday at the Strata conference in San Francisco, Cloudera is previewing a new data visualization and auditing tool that adds this much-needed feature to its big data muncher. The update also includes better data archiving and tweaked Hadoop cluster management tools.

The new data visualization and auditing tool is called Cloudera Navigator 1.0, and it will control freak and document access to data stored in the Hadoop Distributed File System, in the HBase key-value store that rides on top of HDFS, and the Hive data warehousing that overlays HDFS for ad-hoc querying.

Cloudera Navigator is a data-discovery tool, helping analysts figure out what data is being stored in a Hadoop cluster, what formats the data is stored in and where, and how the data got into the system in the first place.

First and foremost, however, it has auditing capabilities that keep track of who did what inside the system, much as other enterprise applications have been doing for many years now. Cloudera Navigator, explains Charles Zedlewski, vice president of products at the company, will verify which users and groups have access to what files and directories in a Hadoop cluster and allows for audit tracking to be turned on for each kind of Hadoop service individually.

Cloudera Navigator also has a dashboard that auditors can query to see who has access to what data, and there is an export feature that can take all of the audit information and port it out so it can be sucked into Security Information and Event Management (SIEM) tools.

The lack of such tools keeps auditors up at night, and we won't think too hard about how excited they get when they see Hadoop being brought under their watchful eyes.

But the fact remains that various regulations – Sarbanes-Oxley, HIPAA, PCI, Basel II, and so forth – have very strict rules about demonstrating that data is only available to those who are entitled to it. And that is why, says Zedlewski, that healthcare, financial, and retail companies have been lining up to beta test Cloudera Navigator.

Cloud Navigator doesn't actually mine the actual data, but rather the metadata that is created as information is poured into the Hadoop system. So you cannot do data discovery or auditing on information that is already in a Hadoop cluster, but you can do it for any new information you suck into it or spit out after munching it.

The data discovery side of the tool is important for ease of use as Hadoop clusters scale, too. "The very act of making Hadoop more of a self-service kind of program is more of a challenge on a petabyte-class system than on a terabyte system," says Zedlewski. You could do data discovery on the raw data in a small cluster, perhaps, but on a petabyte-scale Hadoop cluster with thousands of nodes, you might have 10,000 tables but the metadata only weighs in at a few gigabytes of capacity.

Cloudera has also cooked up a new feature called Enterprise BDR, which is short for backup and disaster recovery, that takes the replication features inherent in HDFS as well as HBase and the Hive metastore and coordinates and orchestrates them so you can do backup and recovery on a remote Hadoop cluster. Right now, says Zedlewski, companies have to do a lot of scripting themselves to take the asynchronous replication features in HDFS and Hive and the synchronous replication used in HBase and keep all the data and metadata in synch on a backup cluster.

Beep beep beep. . . .

For those people with Oracle relational databases, Cloudera Enterprise BDR is analogous to Oracle's Data Guard, which is used to keep backup copies of production databases. Zedlewski says that failover between Hadoop clusters has not been automated, and the recovery time objective for failover, given this and the complexity of replication at these different layers in a Hadoop setup, is 30 minutes to an hour, not minutes or seconds. Cloudera is currently working on snapshotting for HDFS and HBase, and that could close the recovery window.

Cloudera is not providing pricing for either the Navigator or Enterprise BDR features, but Zedlewski says it is a "small incremental charge" that will adds tens of per cents to a Cloudera support license charge, not double or triple it.

And finally, with the update to the CDH4 stack, Cloudera Manager 4.5 is being kicked out, and you can do rolling updates of the nodes in the cluster rather than having to take the cluster down for four to eight hours to upgrade the nodes in a typical 100-node setup. You can update the Hadoop software more frequently and apply security and other patches as needed without taking the cluster down.

Now all Cloudera needs to do is coordinate the rolling updates of Hadoop with rolling updates of Linux, Java, and other elements underneath Hadoop that also need to be patched in a rolling fashion as well.

Cloudera may not provide official pricing, but it says that depending on features it costs anywhere from $2,000 to $4,000 per node for a support contract for the CDH4 stack and the Cloudera Manager.

Biz is booming, and Project Impala is impending

Cloudera is privately held and has raised $141m in five rounds of venture funding from a slew of investors, and must be itching to go public or be acquired for some outrageous multiple.

Zedlewski is not about to comment on any of that, but he did say that Cloudera was "steadily moving out of the startup phase" and now has 320 employees and has more than doubled its bookings and revenues in the past year. The company currently has more than 150 paying customers.

EMC's Pivotal Initiative made a made a big splash ahead of the Strata conference, launching its Hawq SQL database overlay for HDFS, which is a direct competitor to the Project Impala real-time, parallel query extensions that Cloudera cooked up to speed up Hive.

"Impala is going great guns, and we think we will be able to get it to general availability in a month or two," says Zedlewski. ®

Combat fraud and increase customer satisfaction

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
AMD's 'Seattle' 64-bit ARM server chips now sampling, set to launch in late 2014
But they won't appear in SeaMicro Fabric Compute Systems anytime soon
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
Cisco reps flog Whiptail's Invicta arrays against EMC and Pure
Storage reseller report reveals who's selling what
prev story

Whitepapers

Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
3 Big data security analytics techniques
Applying these Big Data security analytics techniques can help you make your business safer by detecting attacks early, before significant damage is done.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.