Original URL: http://www.theregister.co.uk/2009/10/09/xiv_and_the_cloud/

IBM multi-petabyte cloud defies XIV storage

How does it scale that much?

By Chris Mellor

Posted in Storage, 9th October 2009 07:02 GMT

Comment With its Smart Business Storage Cloud, IBM says we have its GPFS and XIV being used to build a system capable of multiple petabytes of capacity and supporting billions of files with high-performance computing-like I/O performance. But XIV has a maximum usable capacity of 79TB.

We can envisage an IBM BladeCenter server set running the GPFS (General Parallel File System) connected to an XIV rack, but this seems utterly unbalanced. There is a BladeCenter/GPFS setup capable of handling multiple petabytes and billions of files on the one hand, hooked up to an XIV array that can only hold 79TB and hundreds of thousands or single digit millions of files of any size on the other. What is going on? How will this work?

To get XIV storage up to the multi-PB level IBM has either to scale up XIV, scale out XIV, or do it indirectly and scale out server nodes with attached XIV arrays. Which is it going to use?

Scaling up XIV

XIV arrays use 1TB drives. Let's bring in 2TB drives and make the capacity 158TB. It's still nowhere near enough. If we assume a multi-petabyte capability means 10PB then we would need 64 times as many drives as it could use today. That means a 120X increase in capacity as we continue scaling towards 20PB.

It seems extremely unlikely that IBM is going to announce a souped-up XIV product with a 64X or 120X increase in capacity. The backplane and controller-drive enclosure interconnect technology would be awesomely difficult to design, engineer and develop.

Scaling out XIV

How about clustering XIV arrays? Can we do that? The XIV, with its 21 nodes, is already internally clustered, IBM describing it thus: "The XIV system is based on a grid of standard, off-the-shelf hardware components connected in any-to-any topology by means of massively paralleled, non-blocking Gigabit Ethernet." Devising a cluster interconnect and node software to keep things on track for an already clustered product is like engineering a cluster of clusters.

To scale to, say, the 10PB level, an XIV array using 2TB drives and maxing out at 158TB would need 64 nodes. A 64-node interconnect capable of scaling to higher node levels is perfectly feasible.

Isilon and Exanet make NAS clusters that go past the 1PB capacity level. Isilon can have up to 96 nodes in a scale-out design using 20Gbit/s InfiniBand. So we could envisage some way of scaling out XIV capacity by linking XIV nodes together with InfiniBand and adding functionality to XIV software to enable the nodes to co-operate.

However, scaling up XIV in this way would need a lot of XIV development work. Suppose we take a different tack.

Indirectly scaling out XIV

The XIV is a block storage device whereas GPFS is, obviously, a file system. Could we be looking at BladeCenter servers running GPFS and acting as network-attached storage (NAS) heads, each with its own XIV array? Scaling the system would mean adding more BladeCenter/GPFS servers plus an XIV array with all the inter-nodal functionality carried out at the server level. IBM says (pdf): "In addition to nodes that are directly attached to the storage, a single GPFS file system can be accessed by thousands of nodes using a LAN connection like Ethernet or InfiniBand."

That sounds promising. It has this to say about GPFS and block I/O: "GPFS has a very flexible cluster architecture providing many options to develop a solution including: direct attached, network block I/O, a combination of the two and multi-site operations... the network block I/O (also called network shared disk or NSD)... is a software layer that forwards block I/O requests from an NSD client application node to the LAN, which then passes the request to an NSD storage node to perform the disk I/O and pass data back to the client. GPFS makes the LAN-based I/O operation transparent to the application. Using a Network Block I/O configuration can be more cost-effective than a full-access SAN and can be used to tie together systems across a WAN."

XIV is an NSD storage node in this scenario.

Which XIV scaling route is IBM going to use: scaling up XIV nodes; scaling out XIV nodes; or indirect scaling out with nodes composed of BladeCenter/GPFS servers each with its own XIV array?

From here it looks as if existing XIV arrays linked to BladeCenter server/GPFS nodes will be tied together in a scale-out architecture with the GPFS/BladeCenter servers performing the interconnect functions. That seems far more likely, and easier to accomplish than scaling XIV up internally or developing XIV super clusters. ®