This article is more than 1 year old

Want to super scale-out? You'll be hungry for flash

Nothing succeeds like flash-excess

Scale-out flash arrays sound excessive but they are really not. After all, we can understand scale-out filers, adding node after node to store rapidly growing file populations.

Use cheap and deep disk for the data, with flash stashes used to hold the metadata and locate files fast. When the files are large then sequential access to and from disk is pretty fast as well.

But won't scale-out flash filers be monstrously expensive? Overkill, surely? Let’s have a look.

Scale out is less expensive than scale up. Instead of having a single, multiple controller head and complex backbone network fabric, as in monolithic DS8000/VSP/VMAX-style arrays, a scale out generally employs multiple independent nodes organised into a cluster which operates as a single system.

It means that when you get the array you don’t have to estimate how big it is going to become and buy all that capacity up front. For example, you might need 100TB now, and 120TB next year, 150TB the year after, 180TB after that and 200TB in year five – a total of 750TB.

Buy all that in year one and you have lots of capacity sitting idle. With scale out you can buy chunks of storage that are better tuned to how much you need when you need it.

Growing clusters

With a cluster, nodes can join when they are needed and the cluster discovers and integrates them. It makes sure the data storage processing and storage burden is shared out, and ensures that the cluster continues to operate when nodes suffer partial or total failure.

Clusters can generally be formed from nodes of different kinds. Thus if new nodes have more advanced processors or different storage media they can still interoperate with older nodes. Adding nodes increases the cluster's overall processing and storage resources.

There are three high-level factors that contribute to cluster file data access delays: metadata navigation to locate a file; disk data access latency to access the file data; and inter-node jumps involved in accessing data and metadata.

NetApp's clustered Data ONTAP arrays can scale to 24 nodes in NAS-only form and eight nodes in SAN (block-access) form. The clusters are made up of up to a dozen high-availability pairs, with both members of the pair being identical and nodes connected across 10GbE links.

The more scalable Isilon filers from EMC can scale to 144 nodes, a different ballpark. The 354TB HD400 capacity nodes are capable of being used to build a 50.9PB cluster and operate with a 50PB-plus global namespace within which files are located and accessed.

What are the differences between a sub-30 node cluster and a 100-plus scale-out cluster? Scale is an obvious answer, and that affects file access times. The multi-storage node problem in a nested file:folder structure is locating a file.

Suppose we have 20 folder levels and the file we want is at the bottom, and accessed as folder0/folder1.n/folder2.n/folder3.n/....folder20.n.file_we_want.txt. The top-level access sequence is to go to each folder index to locate the next level index we want, go to that and on down to the lowest folder containing the file, and then locate the file and fetch it; that takes a nominal 21 disk accesses.

It can get worse. The sub-folders and destination file can be on different systems, involving network hops across the cluster. The more nodes and the larger the namespace then the worse the file access latency problem can be. At the top level we have folder navigation disk access latency and inter-node jump latency.

Data access acceleration

In an Isilon cluster there can be a mix of node types and InfiniBand is used as the cluster interconnect. Each node has full visibility and write/read access to or from the single expandable file system.

This is faster and has a lower latency than 10GbE. A 144-port Mellanox switch is qualified by EMC for the interconnect and a proprietary unicast node-to-node protocol is used.

Nodes might use dual data rate InfiniBand running at 20Gbps and have dual cards, meaning 40Gbps. Newer node types, such as A100, X410 and S210, can use dual quad data rate InfiniBand cards, meaning 2 x 40Gbps or 80Gbps.

Accessing clients come in to the cluster over Ethernet with all intra-node links being InfiniBand. The effect of inter-node jumps is reduced by using InfiniBand instead of Ethernet, and also because information is shared across this internal network and data can be read from or written to any node. That optimises performance when several users are accessing the same data.

The folder navigation latency problem is addressed by storing file metadata, like folder navigation details, in fast-access flash memory. Where a random flash access can take 100 to 150 microseconds, a similar disk access can take 15 milliseconds, a thousand times longer.

Using disk for 21 accesses to get to a file could take 315 milliseconds, whereas using flash would need 3,150 microseconds, or put another way 3.15 milliseconds, 100 times faster.

With billions of file access per hour this metadata access speed up has a profound effect on overall cluster responsiveness. However, time to data access is still mediated by disk speed.

Bigger and bigger

In the last few years, big-data applications using Hadoop file stores have proliferated and the idea of running iterative analysis of large filtered sets of data from massive unstructured and schema-less HDFS data stores is becoming popular.

Such stores break up the data and store it across numbers of compute and storage nodes with a degree of parallel processing.

However, the analysis run time of such processes is gated by disk-access speed. Also, the sequence of operations is pipelined, with dataset filtering followed by successive analysis runs, so the cumulative delay from disk access latency mounts up.

Typically, big-data volumes are huge, hundreds of petabytes for example, and commercial off-the-shelf hardware is used for Hadoop stores to keep costs in check. That means Ethernet communications links and no metadata caching in flash.

To accomplish big-data filtering and iterative analysis runs in real time the lessons of 100-node-plus scale-out clustering need to be applied, with metadata caching in flash, internode communications accelerated and, the last step, all hot data stored in flash.

How might this be achieved?

No current supplier offers a suitable product. SanDisk is developing the concept of a cold archive, with near-instant access to data, using high-capacity 3D TLC (3bits per cell) NAND which could triple the capacity of a current MLC (2bits per cell) planar (single layer of flash cells) SSD.

There are multiple layers of cells in a 3D NAND die and SSD capacity could quadruple; it depends upon the supplier’s balance of sales and costs and profitability.

For now SanDisk has launched its InfiniFlash array for big data, storing 512TB in 3U at less than $2/raw TB. The vendor says TLC flash is in its roadmap for the product.

Startup Skyera was developing high-density flash packing with its skyHawk (136TB in 1U) and skyEagle arrays offering 500TB in 1U. However the firm had difficulty in productising its technology and was snapped up by WD in December.

All-flash scale-out cluster

EMC is developing a hugely scalable all-flash array technology after buying a startup called DSSD last year. Its technology appears to meet an all-flash, big-data scale-out cluster need. It offers:

  • Scale-out node building block in the form of 5U enclosure containing 36 hot-swap flash modules
  • Each has 512 flash dice and card controllers running EMC firmware
  • Hundreds if not thousands of nodes
  • Block access and HDFS support
  • Data stored as objects in a flat file space with no file:folder superstructure
  • Pooled server memory flash
  • NVMe inter-node access using a PCIe node interconnect fabric faster than InfiniBand

DSSD product is being tested by early customers and is expected to be announced later this year. EMC marketing is limited to carefully released glamorous tidbits of information and claims of face-melting performance.

We don’t know how much faster a DSSD node would be than an Isilion node with a flash cache, but the relative increase in speed could be in the tens (almost guaranteed), and perhaps the hundreds (not unlikely), and perhaps even faster (not beyond the bounds of possibility).

We also don’t know how much a 100-node DSSD array would cost. Customers who need humungously fast big-data processing speed, or traditional OLTP speed in financial trading and similar scenarios, may well be willing to pay a double or triple or higher premium over the Isilon node cost.

This is all speculation because no generally available product exists yet and we are living on scale-out flash array hype.

However, EMC, SanDisk and WD/HGST are all serious companies and the need for big-data analysis speed and capacity heading towards exabyte levels is real.

El Reg has no fears in predicting that scale-out all-flash big-data analysis arrays will be generally available in 2016/17 and will transform analysis speeds. ®

More about

More about

More about

TIP US OFF

Send us news


Other stories you might like