Feeds

MongoDB speaks elephantese with Hadoop Connector upgrades

10Gen proves square JSON pegs can be inserted into round HDFS holes

Build a business case: developing custom apps

MongoDB steward 10Gen has increased the capabilities of its Hadoop Connector, which lets administrators shuttle data between MongoDB and HDFS and other Hadoop services.

The updates were announced on Tuesday, and see the company add support for Mongo's Binary JSON (BSON) backup files into the connector, along with support for Apache Hive and incremental MapReduce jobs.

The Hadoop Connector puts MongoDB data in a Hadoop File System (HDFS) costume, letting MapReduce jobs fiddle with the datastores. This tech lets organizations manipulate MongoDB data without having to move it through the data center, saving bandwidth.

Combined, these enhancements help 10Gen push MongoDB into being more than a NoSQL datastore, and into its own platform for minor analytics, data storage, and cross-platform querying. It follows on from IBM implementing support for MongoDB's JSON-oriented query method inside DB2 and WebSphere.

Apache Hive is a query engine for Hadoop that lets people probe HDFS datasets without having to write MapReduce jobs, and instead use a SQL-like query language. This does not map perfectly to MongoDB, and this created some challenges.

"Figuring out a way to express field mappings for fields in Hive to fields in MongoDB in a way that covers the edge cases users may encounter is tricky," 10Gen software engineer Mike O Brien told The Register via email. "Also, there are data types in MongoDB that do not have analogous counterparts in Hive (for example, ObjectId) so there are some design decisions around how to handle those as well."

The JSON filetype is also not native to Hadoop, so work had to be done to get the system to churn through the objects without introducing errors.

"To handle splitting for parallelism, it crawls through a BSON file and calculates byte-offsets in the files to create a list of fixed size chunks which are then processed in parallel," O'Brien writes. "Or, the splits can be pre-built locally with a provided script. When reading the bson off disk, it decodes the bson documents on the fly and passes them into the Mapper as a 'BSONObject' which is the base class used to represent a simple document in the mongo java driver."

In the future, the company plans to boost performance, enforce better integration with various Hadoop APIs, and "expose some more fine-grained control options to the user on how jobs run and read/write data," O'Brien said.

As more and more companies invite Hadoop into their data center, gaining compatibility with the technology will be crucial for new databases, lest developers start forsaking the data stores for more HDFS-friendly systems. With the Hadoop connector, 10Gen is working to make sure this problem doesn't appear, and that DBAs can dance with the elephant, wherever their data is stored. ®

Secure remote control for conventional and virtual desktops

More from The Register

next story
'Stop dissing Google or quit': OK, I quit, says Code Club co-founder
And now a message from our sponsors: 'STFU or else'
Why has the web gone to hell? Market chaos and HUMAN NATURE
Tim Berners-Lee isn't happy, but we should be
Microsoft boots 1,500 dodgy apps from the Windows Store
DEVELOPERS! DEVELOPERS! DEVELOPERS! Naughty, misleading developers!
Mozilla's 'Tiles' ads debut in new Firefox nightlies
You can try turning them off and on again
Apple promises to lift Curse of the Drained iPhone 5 Battery
Have you tried turning it off and...? Never mind, here's a replacement
Uber, Lyft and cutting corners: The true face of the Sharing Economy
Casual labour and tired ideas = not really web-tastic
Linux turns 23 and Linus Torvalds celebrates as only he can
No, not with swearing, but by controlling the release cycle
prev story

Whitepapers

Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Backing up distributed data
Eliminating the redundant use of bandwidth and storage capacity and application consolidation in the modern data center.
The essential guide to IT transformation
ServiceNow discusses three IT transformations that can help CIOs automate IT services to transform IT and the enterprise
Next gen security for virtualised datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.