Feeds

Hadoop's little buddy Nutch 2.0 gulps down web's big data

Apache projects united

Intelligent flash storage arrays

Hadoop daddy Doug Cutting's Nutch, the open-source web-search engine written in Java, has been updated to crawl through piles of big data on the web.

Apache Software Foundation (ASF) has released Nutch 2.0 featuring a data abstraction technique that plugs into big-data stores and frameworks Apache Accumulo, Avro, Cassandra, HBase and, yes, the Hadoop Distributed File System (HDFS).

The abstraction layer that was employed is yet another Apache project, Gora – a framework that provides an in-memory data model and persistence layer for big data.

Gora works with NoSQL column stores, key value stores and document stores, as well as with RDBMSes.

The ASF website where Gora makes its home states its goal as becoming "the standard data representation and persistence framework for big data".

Nutch 2.0 also builds on the Apache open-source search server Soir, which adds a crawler, and a link-graph database with parsing support handled by the Apache Tika project.

Cutting wrote Nutch in 2003 with Mike Cafarella, while the pair were also developing Hadoop – using Google's MapReduce distributed data processing framework to make Hadoop system work at scale. Cutting also wrote Lucene, but it was Hadoop that made his name and he was brought in by Yahoo! to implement the system on its servers.

Nutch has since been somewhat eclipsed by Hadoop, which is used by Amazon.com, Facebook and Yahoo! to name just three web giants. Search engines written using Nutch include Krugle and mozDex. ®

Internet Security Threat Report 2014

More from The Register

next story
Be real, Apple: In-app goodie grab games AREN'T FREE – EU
Cupertino stands down after Euro legal threats
Download alert: Nearly ALL top 100 Android, iOS paid apps hacked
Attack of the Clones? Yeah, but much, much scarier – report
You stupid BRICK! PCs running Avast AV can't handle Windows fixes
Fix issued, fingers pointed, forums in flames
Microsoft: Your Linux Docker containers are now OURS to command
New tool lets admins wrangle Linux apps from Windows
Facebook, working on Facebook at Work, works on Facebook. At Work
You don't want your cat or drunk pics at the office
Soz, web devs: Google snatches its Wallet off the table
Killing off web service in 3 months... but app-happy bonkers are fine
First in line to order a Nexus 6? AT&T has a BRICK for you
Black Screen of Death plagues early Google-mobe batch
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
10 threats to successful enterprise endpoint backup
10 threats to a successful backup including issues with BYOD, slow backups and ineffective security.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.