The God Box: Searching for the holy grail array

Original URL: https://www.theregister.com/2012/02/13/the_god_box/

Latency killing super spinner

Posted in Storage, 13th February 2012 08:02 GMT

It's so near we can almost smell it: the Holy Grail storage array combining server data location, solid state hardware speed, memory access speed virtualisation, and the capacity, sharability and protection capabilities of networked arrays. It's NAND, DAS, SAN and NAS combined; the God storage box – conceivable but not yet built.

We can put the Lego-like blocks together in our minds. The God box building blocks are virtualised servers; PCIe flash; flash-enhanced and capacity-centric SAN and NAS arrays and their controller software; atomic writes; flash memory arrays; and data placement software. The key missing pieces are are high-speed (PCIe-class) server-array interconnects and atomic writes - direct memory to NAND I/O.

The evil every storage hardware vendor is fighting is latency. Applications want to read and write data instantly. The next CPU cycle is here and the app wants to use it and not wait for I/O. Servers are becoming super-charged CPU cycle factories, and data access I/O latency is like sets of traffic lights on an inter-state highway: they just should not be there.

Killing latency

I/O latency comes from three places broadly speaking: disk seek times, network transit time, and operating system (O/S) I/O subsystem overhead. The disk seek time problem has been cracked; we are transitioning to use NAND flash instead of spinning disk for primary data, the hot, the active data. Disk remains as the obviously most effective large-scale media for data, particularly if it is deduplicated. Flash cannot touch it.

There have been four ways of doing this:

We are seeing SSDs slotted into hard disk drive (HDD) slots, with data placement software, like FAST VP, automatically moving data between HDD and SSD as its 'access temperature' rises and falls.
We are also seeing flash used as an array controller cache, with NetApp's FlashCache and EMC's FAST CACHE.
We are seeing newly architected flash and HDD arrays which do a better job, they say, of using flash storage and HDD capacity together. Think NexGen Storage, Nimble Storage; and Tintri.
We are seeing all-flash arrays which abandon disks altogether and rely on deduplication, MLC flash and flash-focused, not HDD-focused controller software, in order to bring perGB cost close to that of disk drive arrays. Think Nimbus, WhipTail, Violin Memory, and startups like Pure Storage, ExtremIO and SolidFire.

The big "but" with these four approaches is that network latency still exists – as does the I/O latency from the O/S running the apps. These four approaches only go part of the way on the journey to the God Box.

Storage and servers – come together

Network latency is vanquished by putting the storage in the server or the server in the storage. Putting HDD storage in the server, the direct-attach storage (DAS) route gets rid of network latency but disk latency is still present. We'll reject that. Disks are just ... so yesterday, and it has to be solid state storage.

There are two approaches to server flash right now: use the flash as a cache or a storage. PCIe flash caches are two a penny: think EMC VFCache (the latest), Micron, OCZ, TMS, Virident and others. You need software to link the cache to the app and you need a networked array to feed the cache with data. This is only a halfway house again because cache mises are expensive in latency terms.

If it's a read cache then its a "quarterway" house, as writes are not cached. If it doesn't work with server clusters, high-availability, vMotion and/or and server failover then it's an "eighthway" house. Most of these issues can be fixed but there is no way a cache can guarantee cache misses won't happen; it's the nature of caching. No matter that caches connected to back-end arrays can offer enterprise-class data protection; the name of the game is latency-killing and caching doesn't permanently slay the many-headed latency hydra. So the flash has to be storage.

Fusion-io is the leading exponent of putting flash as storage into servers. What about putting servers in storage? DataDirect says it does that already with filesystem applications hosted in its arrays. Okay, we'll grant the principle but not the actuality as non-one is running serious business applications in DDN arrays yet.

EMC is saying that virtualised server apps will be vMotioned to server engines in its VMAX, VNX and Isilon arrays. Okay. This means an exit of network latency and, if the arrays are flash-based with flash-aware controllers and not bodged disk-controller SW, then an exit of drive array latency.

EMC is serious and vocal about this approach so we must pay it heed. And we must note that the flash storage tier can be backed up with massive HDD array capacity and protection features. This is a very attractive potential mix of features, although only for servers in the array - I'm hinting at server supply lock-in here - and only if it becomes mainstream, and if it can get rid of the server O/S I/O subsystem latency.

Fusion's new stake

That is where Fusion-io has a new stake in the game, a bet at the storage-server casino that might surprise us.

It has its Auto-Commit Memory (ACM) scheme, which bypasses the (disk-based) I/O subsystem in the O/S hosting a server and dies direct memory-to-NAND reads and writes. Fusion reckons its effective ioDrive flash card performance could be boosted 16 times or more in IOPS terms with this approach. Wikibon uses the term Atomic Writes for the concept.

That means Fusion-io is the only vendor in our area to get rid of disk latency, network latency and O/S latency – but it does so at the cost of not being shareable and not having enterprise-class data protection features. It's stymied, right? Not necessarily, and what's it is doing can also be done by competitors.

Aprius was a server PCIe bus extending and virtualising startup with technology to let servers share peripheral devices at PCIe bus speeds over a virtual PCIe bus network. It failed last year. Fusion-io marketing VP Rick White said:

Fusion-io acquired certain Aprius IP assets, including three US patents and 20 patent applications last summer. There certainly were a lot of smart people at Aprius, and a number of former Aprius employees are now working at Fusion-io. Please note that these folks did not join Fusion-io as part of our patent acquisition. They joined us as any employee would after going through our standard hiring process.

Now why would Fusion-io do that? Rick White wouldn't say: "I wish we could share more info on all aspects of our business and our plans. Unfortunately sometimes we’re not able to comment as freely as we’d like to but I hope you can understand."

VIrtual ioDrives

Fusion-io sees EMC building Project Thunder, a high-speed networked flash array box full of Lightning flash cards, InfiniBand-class links to servers and, no doubt, seen as a FAST VP tier by back-end VMAX or VNX arrays. Oops, Fusion would be outflanked here by an EMC combo of DAS flash speed, flash as server storage, and enterprise drive array protection levels – plus the obvious risk of VMware getting direct memory-to-NAND I/O capability like that of Fusion's ACM.

That's it, game over – with EMC killing O/S latency, network latency and HDD latency in one storage box. Only Fusion is not going to let this happen. What it may do, or so the El Reg storage desk believes, is add Aprius PCIe virtualisation technology to its ioDrive technology and build a shareable ioDrive array – something that partner NexGen does at the moment.

We would see servers having virtual ioDrives, storage memory areas, mapped to partitioned-off space in a shared, PCIe bus-networked Fusion-io array, and still being capable of direct memory-to-NAND reads and writes.

There needs to be another piece, or even two pieces possibly added here. One is enterprise-class data protection with snapshots, clones and replication. The other is a capacity play. Fusion-io could add its own backend disk storage or do a deal with a cloud storage provider – like Nirvanix or Joyent (both enterprise-class), or Amazon and Google – and have the cloud become the back-end capacity storage vault and protection destination. That would significantly reduce its R&D.

IOV

Server I/O virtualisation (IOV) is a tough game but looks to be an essential piece of our God box-building exercise. Micron has picked up Virtensys which failed to popularise its technology. Aprius crashed and, by the way, a co-founder, Peter Kirkpatrick, now works as a principle engineer for Violin Memory, so it too could be thinking of taking its existing direct server PCIe connect capability and turning it into a shared server PCIe connection capability, and so getting rid of network latency.

EMC's pre-selling of ProjectThunder has raised the stakes in the high-speed server-to-server-and-storage interconnect game enormously. It and Fusion-io have the most skin in the game. The other flash array, flash-HDD combo array, PCIe flash cache, and flash-enhanced drive array vendors have to decide whether the end game we have described here – our mythical God Box – is a coming storage reality or pie in the sky.

If it is unreal, then carry on doing what you are doing.

If it isn't ... you are toast - unless you respond. ®