Feeds

Hadoop's little buddy Nutch 2.0 gulps down web's big data

Apache projects united

Secure remote control for conventional and virtual desktops

Hadoop daddy Doug Cutting's Nutch, the open-source web-search engine written in Java, has been updated to crawl through piles of big data on the web.

Apache Software Foundation (ASF) has released Nutch 2.0 featuring a data abstraction technique that plugs into big-data stores and frameworks Apache Accumulo, Avro, Cassandra, HBase and, yes, the Hadoop Distributed File System (HDFS).

The abstraction layer that was employed is yet another Apache project, Gora – a framework that provides an in-memory data model and persistence layer for big data.

Gora works with NoSQL column stores, key value stores and document stores, as well as with RDBMSes.

The ASF website where Gora makes its home states its goal as becoming "the standard data representation and persistence framework for big data".

Nutch 2.0 also builds on the Apache open-source search server Soir, which adds a crawler, and a link-graph database with parsing support handled by the Apache Tika project.

Cutting wrote Nutch in 2003 with Mike Cafarella, while the pair were also developing Hadoop – using Google's MapReduce distributed data processing framework to make Hadoop system work at scale. Cutting also wrote Lucene, but it was Hadoop that made his name and he was brought in by Yahoo! to implement the system on its servers.

Nutch has since been somewhat eclipsed by Hadoop, which is used by Amazon.com, Facebook and Yahoo! to name just three web giants. Search engines written using Nutch include Krugle and mozDex. ®

The essential guide to IT transformation

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Munich considers dumping Linux for ... GULP ... Windows!
Give a penguinista a hug, the Outlook's not good for open source's poster child
Intel's Raspberry Pi rival Galileo can now run Windows
Behold the Internet of Things. Wintel Things
Linux Foundation says many Linux admins and engineers are certifiable
Floats exam program to help IT employers lock up talent
Microsoft cries UNINSTALL in the wake of Blue Screens of Death™
Cache crash causes contained choloric calamity
Eat up Martha! Microsoft slings handwriting recog into OneNote on Android
Freehand input on non-Windows kit for the first time
Linux kernel devs made to finger their dongles before contributing code
Two-factor auth enabled for Kernel.org repositories
prev story

Whitepapers

Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 10 endpoint backup mistakes
Avoid the ten endpoint backup mistakes to ensure that your critical corporate data is protected and end user productivity is improved.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Rethinking backup and recovery in the modern data center
Combining intelligence, operational analytics, and automation to enable efficient, data-driven IT organizations using the HP ABR approach.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.