Feeds

Google's epic graph cruncher mimicked with open source

And then there was GoldenOrb

Internet Security Threat Report 2014

Unlike Facebook or Yahoo!, Google is loath to open source its back-end software. For many, this is a sore point, as the search giant has built its famously distributed infrastructure atop countless open source tools fashioned outside the walls of the Googleplex. But Mountain View does give back in less-direct ways.

In some cases, Google will publish research papers describing one of the proprietary platforms driving its back end, and to a certain degree this allows outside developers to mimic these platforms with open source projects. Google papers on its GFS distributed file system and MapReduce distributed number-crunching platform, for example, gave rise to the open source Hadoop, and a paper on its BigTable distributed database sparked the open source HBase project.

Now, much the same thing has happened with Google Pregel, Mountain View's platform for processing enormous online graphs, such as a map of the web itself, or of a social network, graphing relationships between people. This week, a Texas startup known as Ravel unveiled an open source project based on Google's 2010 paper describing Pregel. Open sourced under an Apache license at GitHub, the project is dubbed GoldenOrb.

Zach Richardson

Zach Richardson

A few years back, while working on a PhD in computational mathematics, Ravel president and GoldenOrb lead architect Zach Richardson helped found a small company that basically helped other businesses processes large amounts of data, including "semantic web" data, which seeks to give machines a better "understanding" of text on the internet. They soon realized that to solve such problems required better tools.

"We were doing consulting at the intersection of the semantic web and Big Data," Richardson tells The Register. "Semantic web data is inherently stored in a graph, and when you get to very large data sizes, a lot of the traditional methodologies for processing or trying to understand that data no longer work. Or they don't scale. Or they take a completely unrealistic amount of time."

As luck would have it, Google published its Pregel paper. Pregel is a computational model that dovetails with Google's existing data-storage technologies, including the Google File System (GFS) and BigTable. In essence, data from GFS or BigTable is shuttled to Pregel, where the data is crunched. Presumably, Google Chubby – the company's distributed lock service – is used to manage access to data.

In the open source world, GFS, BigTable, and Chubby are mirrored by the Hadoop File System (HDFS), HBase, and Zookeeper. Naturally, Richardson and his fellow developers built GoldenOrb atop such open source platforms. Zookeeper handles data synchronization across distributed machines, and Hadoop's remote procedural call (RPC) passes message from node to node. Google's Pregel paper provides a high-level description of the Pregel programming model, but the rest was guesswork.

"Google provides the higher level concepts of when something needs to synchronize, what communications need to happen between servers, and what your programming model needs to look like to use it," Richardson says. "But how close we are to their implementation? It's very hard to guess."

Building an initial GoldenOrb platform took about seven months of "on and off" work. Richardson and his team has not even had a cursory discussion with Google about the platform.

In addition to developing the platform, Ravel will build applications that run atop it. "We're focused on building enterprise products that analyze data," Richardson says. "Graph problems are [almost infinite]. This includes social network analysis, a very popular topic, but the same algorithms might also be used in things like epidemiology research or pharmaceutical research."

According to Richardson, the platform is suited to situations in which you need random access to data while your algorithm is running. MapReduce is designed for batch processing. You take a large chuck on data, break it up into tiny pieces, and spread it across a cluster of machines for processing. GoldenOrb can run algorithms that grab particular pieces of information from distributed machines on the fly. "With MapReduce, if I'm doing a calculation on one machine and I happen to need information on another machines, there's no way to get it," he says. "GoldenOrb can share information across all machines as necessary to solve the problem."

Ravel employees about fifteen people, and according to Richardson, the company has already started building products on the open source platform. But he declined to discuss them. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Preview redux: Microsoft ships new Windows 10 build with 7,000 changes
Latest bleeding-edge bits borrow Action Center from Windows Phone
Google opens Inbox – email for people too thick to handle email
Print this article out and give it to someone tech-y if you get stuck
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
FTDI yanks chip-bricking driver from Windows Update, vows to fight on
Next driver to battle fake chips with 'non-invasive' methods
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Entity Framework goes 'code first' as Microsoft pulls visual design tool
Visual Studio database diagramming's out the window
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.