The Register® — Biting the hand that feeds IT

Feeds

Storage startup busts object location barrier

Scality's Ring cycle: One Ring to find them all

Magic Quadrant for Enterprise Backup/Recovery

File system trees are inefficient and slow when locating files in a filespace occupied by billions of files and folders. Storing the data as objects in a flat storage space is becoming a recommended alternative. But, as soon as you go for object storage to defeat this file system tree traverse problem, you face a fresh problem: how do you locate your objects?

Either you have a central object map or database or you have a distributed one. French startup Scality has gone for the distributed approach with its Ring technology.

The idea is to have virtual unlimited scalability, both of I/O and storage capacity, by using clustered commodity X86 servers organised as peer-to-peer nodes – conceptually occupying  a ring – with front end Accessor software nodes receiving requests from users and applications on servers.

Scality CEO Jerome Lecat says an Accessor node can access any Ring node, note the "any", and find the right node storing a requested object in one network hop with 10 nodes, two hops with 100 nodes, and three hops with 1,000 nodes.

Scality Ring technology

Holy Trinity in Scality's Ring technology: Accessors to the left, the Ring in the middle and secondary storage to the right.

A variety of Accessor node technologies are supported: native REST HTTP, NFS, BRS2 and Zimbra.

With each 10X increase in the Ring node count, the hop count goes up by one because of Scality's patented technology and its algorithm. We might call this a quite peculiar Ring cycle.

Lecat said: "There are really two 'tricks' here. [First] an algorithm delivering a maximum of Log(n) complexity – which basically gives one a 100-node network. Each node needs to know seven nodes, and a request may take seven hops. The minimum requirement from a mathematical standpoint is for each node to know a few other nodes. The number of nodes increases as Log(number of nodes), which means that when the number of nodes is x10, you need to add 1 to the number of nodes to be known, or number of hops.

"[Secondly] in practice, we allow nodes to know many more nodes, but this acts as a 'non authoritative cache', and it allows for a request to 'usually' converge in two hops, while keeping all the mathematical properties of the model (Log complexity, limited number of hops, good behaviour when a node is lost or added)."

Each node can handle 10 to 50TB of storage, with 1,000 nodes supporting up to 50PB of capacity, and accessing the right object in that 50PB with three hops on a gigabit LAN takes 20ms or less.

Distributed hash table

How does that work? Scality documentation says that the Ring nodes are organised into segments. Objects are stored with a Distributed Hash Table (DHT) algorithm, which produces a value for the object and its associated key. Key and value pairs are stored in the DHT and nodes retrieve the value associated with a particular key. Responsibility for maintaining the mapping from keys to values is distributed among the nodes. Keys embed information about class of service, and each node is autonomous and responsible for consistency checking and rebuilding replicas automatically for its keys.

We can think in terms of Scality's Ring nodes crossing a key space. This is organised into a hierarchy such that a 10-node ring requires one node-to-node hop to find the target node, a 100-node ring needs two hops and a 1,000-node monster needs three hops.

Lecat says: "The key space is distributed among all the nodes. The key space is very large (20 bytes), and distributed nearly evenly, but never exactly evenly. The underlying algorithm is a distributed hash table. The 'segments' do not have a constant size (as everything has to be dynamic in the system to allow real elasticity).

"Two key properties of the key space are that keys have an order, and they are organised into a circle (which gives trigonometic properties)."

Let's take a 10-node Ring as an example. An Accessor sends in a object retrieval request to node 1, which doesn't have it. We're told the object can be retrieved with one hop, a jump from node 1 to the right node. Node 1 has enough information to send the request on to the right node, the one that holds the object, and so does every other node in the 10-node ring: that's how a distributed hash table works.

Scality doesn't say in detail how this works. I think it is a variation on this concept: each node has an ID and nodes are organised in a ring, a double-linked list, with each node having a reference to the previous node on the ring, its address, and the next ring node, and its address. Nodes going round the ring have successively greater node IDs until you return to the starting node.

Okay? Keep that in mind and let's move on to the request receiving node, which gets the key from the Accessor request and hashes it to generate a key of exactly the same number of bits as the node reference. The system uses this as a node ID and goes round the ring node by node, looking for a node ID that is the closest possible to the key hash while still being larger. That node should store the desired object.

A reversal of this is used to store incoming objects on the Ring and ensure they are locatable.

Lecat said: "If a node is lost, the ring rebalances itself without human intervention. [It's the] same if a ring node is added (human intervention needed to decide to add a node), the new node is automatically placed well in the key space, and rebalances only occur when necessary and automatically."

To understand any more than this requires a computer science skill set and access to the Scality Ring designers.

Agentless Backup is Not a Myth

Latest Comments

MPAA to send take down notice to all these Distributed hash table P2p for profit any day now ;)

MPAA and others , perhaps Antipiratbyrån (Anti-Piracy Bureau, APB) in the EU OC, to send take down notice's to all these Distributed hash table P2p for commercial profit companies any day now ;)

after all Distributed hash table and P2P are illegal the world over or so they would have you believe so Distributed hash table P2P coming to the cloud can only mean Piracy right, or that will be how they spin it if it suits their everything's piracy because we say so agenda

0
0
Anonymous Coward

Not exactly new

This same sort of DHT or consistent-hashing approach is already widely used. Amazon's Dynamo and Berkeley's OceanStore both work essentially this way for different kinds of data, and have been around for over ten years. Atmos, which you even mention, inherits some of the same ideas from OceanStore. Tahoe-LAFS and GlusterFS both embody those ideas for files, Voldemort/Cassandra/Riak for key/value stores, etc. Scality might indeed have some very good technology, but the DHT part isn't their differentiating feature.

0
0

Testing 10,000 nodes

They could run 10 VM instances per physical node. Of course that'll introduce inefficiency, but it should be just a constant VM overhead, and will let them see if anything breaks down or scales badly.

0
0

More from The Register

SCO vs. IBM battle resumes over ownership of Unix
Zombie lawsuit back and wants to suck the brains out of Linux
 breaking news
You don't need phone lines or cable for ANYTHING, says Dish
The satellite-dish man can sort you out with phone and broadband over the air too
 breaking news
What's HP got under wraps? Looks awfully flash and tape shaped
What happens in Vegas won't stay there - we've got the details
AMD lifts the veil on Opteron, ARM chip plans for 2014
Not much action going on in 2013, though
Microsoft borks botnet takedown in Citadel snafu
Stupid Redmond kicked over our honeypots, wail white hats
IBM's $1bn layoffs latest: Now axe swings in US, Canada - reports
Union claims 121 storage bods canned after dismal sales
NetApp musters muscular cluster bluster for ONTAP busters
Storage array OS overhauled to juggle more nodes, go down on you, er, less
HP adds 'Haswell' Xeon E3s to entry ProLiant servers
Gussies up MicroServer for SMBs, adds baby switches