Feeds

MongoDB speaks elephantese with Hadoop Connector upgrades

10Gen proves square JSON pegs can be inserted into round HDFS holes

Intelligent flash storage arrays

MongoDB steward 10Gen has increased the capabilities of its Hadoop Connector, which lets administrators shuttle data between MongoDB and HDFS and other Hadoop services.

The updates were announced on Tuesday, and see the company add support for Mongo's Binary JSON (BSON) backup files into the connector, along with support for Apache Hive and incremental MapReduce jobs.

The Hadoop Connector puts MongoDB data in a Hadoop File System (HDFS) costume, letting MapReduce jobs fiddle with the datastores. This tech lets organizations manipulate MongoDB data without having to move it through the data center, saving bandwidth.

Combined, these enhancements help 10Gen push MongoDB into being more than a NoSQL datastore, and into its own platform for minor analytics, data storage, and cross-platform querying. It follows on from IBM implementing support for MongoDB's JSON-oriented query method inside DB2 and WebSphere.

Apache Hive is a query engine for Hadoop that lets people probe HDFS datasets without having to write MapReduce jobs, and instead use a SQL-like query language. This does not map perfectly to MongoDB, and this created some challenges.

"Figuring out a way to express field mappings for fields in Hive to fields in MongoDB in a way that covers the edge cases users may encounter is tricky," 10Gen software engineer Mike O Brien told The Register via email. "Also, there are data types in MongoDB that do not have analogous counterparts in Hive (for example, ObjectId) so there are some design decisions around how to handle those as well."

The JSON filetype is also not native to Hadoop, so work had to be done to get the system to churn through the objects without introducing errors.

"To handle splitting for parallelism, it crawls through a BSON file and calculates byte-offsets in the files to create a list of fixed size chunks which are then processed in parallel," O'Brien writes. "Or, the splits can be pre-built locally with a provided script. When reading the bson off disk, it decodes the bson documents on the fly and passes them into the Mapper as a 'BSONObject' which is the base class used to represent a simple document in the mongo java driver."

In the future, the company plans to boost performance, enforce better integration with various Hadoop APIs, and "expose some more fine-grained control options to the user on how jobs run and read/write data," O'Brien said.

As more and more companies invite Hadoop into their data center, gaining compatibility with the technology will be crucial for new databases, lest developers start forsaking the data stores for more HDFS-friendly systems. With the Hadoop connector, 10Gen is working to make sure this problem doesn't appear, and that DBAs can dance with the elephant, wherever their data is stored. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
Be real, Apple: In-app goodie grab games AREN'T FREE – EU
Cupertino stands down after Euro legal threats
Download alert: Nearly ALL top 100 Android, iOS paid apps hacked
Attack of the Clones? Yeah, but much, much scarier – report
You stupid BRICK! PCs running Avast AV can't handle Windows fixes
Fix issued, fingers pointed, forums in flames
Microsoft: Your Linux Docker containers are now OURS to command
New tool lets admins wrangle Linux apps from Windows
Facebook, working on Facebook at Work, works on Facebook. At Work
You don't want your cat or drunk pics at the office
Soz, web devs: Google snatches its Wallet off the table
Killing off web service in 3 months... but app-happy bonkers are fine
First in line to order a Nexus 6? AT&T has a BRICK for you
Black Screen of Death plagues early Google-mobe batch
prev story

Whitepapers

Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
10 threats to successful enterprise endpoint backup
10 threats to a successful backup including issues with BYOD, slow backups and ineffective security.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Protecting users from Firesheep and other Sidejacking attacks with SSL
Discussing the vulnerabilities inherent in Wi-Fi networks, and how using TLS/SSL for your entire site will assure security.