Original URL: http://www.theregister.co.uk/2011/01/31/alacritech_nfs_read_anomaly/

Alacritech apprehends an NFS anomaly

Wildly unbalanced filer I/O

By Chris Mellor

Posted in Storage, 31st January 2011 13:00 GMT

Comment Alacritech claims NFS filer I/O is grossly skewed towards reads and suffers from read metadata processing that chokes controller CPUs.

It has just launched its ANX 1500 filer accelerating cache product based on its recognition of NFS read metadata filer I/O loads that can overwhelm filer processors and delay file delivery.

A couple of years ago Alacritech had a 10gig Ethernet adapter nearing readiness but found that the market had moved on, wanting converged network adapters (CNA) which could do FCoE and iWARP, as well as iSCSI and TCP/IP Offload and basic Ethernet NIC'ing. It would need to have written its own code or licensed IP from Emulex or QLogic and decided, according to marketing VP Doug Rainbolt, it was "not worth it". (Ironically Emulex licenses Alacritech IP for its CNA.)

Alacritech decided to turn aside from the adapter business and, reflecting its founders' Auspex roots, look at accelerating network-attached storage (NAS) file access. Most filer shops use NFS v3. Close inspection to NFS v3 filer I/O patterns showed wild read and write asymmetry. One Fortune 500 company exhibited this pattern:

From the point of view of the filer's controller, half of its life was spent getting data off the disk drives and out to accessing host servers and the other half checking the metadata associated with read requests. Write I/O activity was basically inconsequential. Particularly from the disk I/O point of view as writes would be cached in the controller's NVRAM and re-ordered to provide near-sequential I/O. Also, for NetApp users, Rainbolt said WAFL is good for writes.

Reads can not be re-ordered because they have to be answered as and when they come in and are randomly located on the filer's drive platters. The typical answer to this is to use high-speed drives and, if necessary, short-stroke them to minimise head movements (seek time). Both are expensive to do.

But what Alacritech realised was that the randomness of read I/O wasn't the only problem – read metadata was just as big a problem, turning a filer's processors into access bottlenecks if enough metadata checking was needed. Rainbolt said: "The controller is becoming a bottleneck before the disk drives do. The processor can't keep up ... Metadata consumes the CPU like you wouldn't believe."

If you could remove the metadata checking from the filer's CPUs and carry it out some place else, then the filer could get on with its core job of answering read requests and serving files as fast as it is capable of doing.

Alacritech and Isilon

Rainbolt said Isilon's scale-out clustered filers are affected by the same problem even though they serve lots of large files, meaning more sequential than random reads. Accessing clients store lots of Isilon-originated data in their caches and check whether their cache contents are up to date before hitting the Isilon fillers with read requests, meaning the Isilon processors can also get hit with metadata requests. Isilon-type systems also struggle when faced with lots of small file requests.

Rainbolt said an example 9-node Isilon system was running 500,000 NFS metadata operations per second. Placing an Alacritech ANX 1500 front-end metadata offload engine in front of it bumped the number up to 2.6 to 3 million NFS metadata ops/sec and the Isilon served more files.

In other words, Alacritech contends, there is generic filer processor bottlenecking going on, slowing down filer responsiveness to read requests, due to the metadata processing consequent on NFS v3 read requests.

Isilon has added flash to speed up metadata operations.

Alacritech saw an opportunity to cache filer metadata in a front-end device, its ANX 1500 – an NFS metadata offload engine in effect – and remove that burden from the filer. That means filers can stop using lots of expensive short-stroked 15K rpm drives and revert to using fewer slower and cheaper middle of the road drives.

Alacritech co-founder Peter Craft said: "We created an appliance to do metadata caching and use SSD (Solid State Drives). It involves our NFS Bridge technology and uses the ASICs from our 10gig Ethernet adapter work. It is very efficient and we have very low CPU utilisation on our box."

The ANX 1500 uses these ASICs with micro-code and has a "very thin, high-performance operating system."

Alacritech and NetApp

Craft said that other people saw there was a file access speed problem and recognised flash was a potential solution – and so mentioned NetApp's PAM (Performance Acceleration Module, now called Flash Cache). This is a slug of flash in NetApp's FAS controllers which functions as a read cache. He said: "In SPEC results PAM systems use fewer disk drives but the top end result is the same because they are CPU-bound. Even Avere can only do 22,000 ops. We can scale to hundreds of thousands of (SPEC NFS) ops."

He is saying that NetApp filers are limited in NFS ops scalability because they become limited by CPU processing bandwidth and not disk bandwidth. Cache resolves disk bandwidth problems but sits downstream of the CPUs and doesn't fix CPU issues.

Alacritech and Avere

Avere's clustered FXT accelerator nodes are an obvious competing technology to Alacritech and is also based on deep analysis of filer I/O patterns. Alacritech says its technology is better – it would, wouldn't it – and seizes upon latency as one point of difference.

Craft said: "This is a key differentiator for us. Our latency is in the 0.2 millisecond range. The best SPEC organisation results is 0.5 milliseconds, with Avere. We're less than half the latency of the competition."

Why is this important? "Latency, aggregated across millions of I/O ops, translates into time."

Alacritech also says that Avere's technology, unlike its own, doesn't fit seamlessly into existing NAS infrastructures. The company says that Avere's tech:

This idea that a front-end cache hijacks and steals data from the back-end filer is certainly a colourful one. It is almost being hinted, I think, unless wishful thinking is happening, that the Alacritech and Avere approaches might even be complementary. One wonders what a combined device might look like and what effect that might have on NAS filers. Perhaps it is a stupid idea, like trying to combine car engine super-charging and turbo-charger in a single super-duper-turbo-charger.

Craft said: "Avere has its strengths. We're not bashing them." But a single Avere FXT device tops out at around 20,000 NFS ops." If you put one in front of a NetApp FAS 3160, which can do about 60,000 NFS ops, you would slow it down, and: "you have to have clustered FXTs to solve that bottleneck ... We are a high-performance caching tier that doesn't do what Avere does."

Alacritech simply wrong

What does Avere say about all this? Rebecca Thompson, Avere's Marketing VP, took each Alacritech point above in turn and said:

•Ignores benefits and investment in existing NFS infrastructure - The Avere product line was designed to work in conjunction with our customer's existing NAS (both NFS as well as CIFS) infrastructure by offloading heavy performance loads, allowing customers to actually extend the lifecycle of their investment by not having to replace filers or add additional drives just for performance.

• Uses caching to hijack and steal data from back-end - Aside from the emotional language (NAS boxes do not have feelings) used, this claim is just simply wrong. That's like saying that the RAM in your PC is hijacking data from your SATA drive. Putting data on the best storage media to meet performance needs improves the overall productivity of an organization, which is why IT exists.

• Hides NAS management tools - False. I'm not even sure how one would be able to do this. We provide additional storage performance monitoring capabilities that customers love, but they still have complete access to whatever tools they use for storage server management.

• Relegates NAS back-end to mass storage without intelligence - False. In fact, we promote the fact that NAS storage servers have excellent data management tools and have engineered our nodes to work in conjunction with existing snapshot and backup schedules.

• Now owns mission critical data - Avere holds the active data set; however all read data still resides on the storage server as well. In the case of write data, it is completely up to the customer what schedule they set for write back (can be from seconds to hours to days) or they can choose to run Avere in write-through mode in which all write data is immediately written back to the filer.

• More media means more complexity - That's the beauty of automatic tiering - the fact that it's automatic reduces the complexity. The tiering algorithms in our software does the work so the storage administrator doesn't have to.

• Risks of data loss - Unlike Alacritech, Avere actually has had HA for its solution since it began shipping. Each node has both an NVRAM card in case of nod failure while holding dirty write data. In addition, each node has a peer within the cluster that it mirrors data to.

• More difficult to manage - Most of our customers claim the opposite, that because our performance monitoring tools give them such good insight into what's going on, it has made their job easier.

• Scalability is limited - One of the primary benefits of a clustered scale-out system is that it can scale by adding nodes. We currently support up to 25 nodes in a cluster, which would give us 90TB of capacity with a 2500 cluster and 13TB with the 2700. In addition, a single cluster can front end up to 24 filers, providing enormous flexibility as well.

• Cumbersome to configure - Just plain not true. We have a simple GUI configuration with only a few inputs required to get up and running. In addition, as new nodes are added, they auto-join the cluster. Avere would be happy to demo this.

Alacritech beta testing

Avere is into its second generation product while Alacritech is still in beta test. The Alacritech pitch is that lots of NetApp and Isilon shops, and EMC and BlueArc shops too, need acceleration – which Alacritech can best provide. The ANX 1500 is being tested in four large customer sites: "two large entertainment companies, a billion dollar EDA (Electronic Design Automation) company, and a multi-national, large high-tech component company in a virtualised environment."

Rainbolt says: "So far it has gone very well for us. There is a lot of interest."

Alacritech says there are three customer environments particularly relevant to its ANX technology: metadata-intensive environments, large sequential block environments, and virtual environments with apps and guest O/S deployed and housed via NFS.

Rainbolt's message for companies with NFS filers in these environments is to have a look at Alacritech's technology. What we want here in El Reg is to see SPEC NFS benchmark comparisons. We expect Alacritech ANX 1500 scores to be very interesting. ®