This article is more than 1 year old

NetApp’s effort to feed big data beast through NFS makes no sense

Latency on one side, huge capacities on the other

Opinion I’ve just finished reading some a few articles about NetApp on the Register and Val Bercovici on NetApp’s blog and I have a question. NFS for Hadoop? Really?

In this correspondent's opinion, it simply makes no sense at all.

Yes, you could find corner cases – you can always find a corner case for something you love – but in this case, that’s about all you can find.

And I’m not talking about data ingestion here.

Storing (BIG) data on primary storage

One of the benefits of HDFS is that it is a distributed filesystem and it has all the embedded availability, replication and protection mechanisms you need for storing huge amounts of data safely and, above all, it’s very inexpensive.

In fact, you can build your HDFS-based storage layer by simply adding disks into cluster nodes, and all the management tools are integrated. At the end of the day, it’s just a file system that you get for free with any Hadoop distribution.

Despite all its defects, HDFS is optimised to do that job, it’s “local” to the cluster, it is designed to move big data chunks and it doesn’t need the special attention usually required for primary storage. The total cost of acquisition and total cost of ownership of HDFS is very low.

Primary storage can easily be positioned on the opposite side:

  • It definitely has problems in managing big data analytics and traditional enterprise workloads at the same time (especially if they need to leverage the same resources. Quality of service is still an option for most storage vendors).
  • It also introduces huge management costs when it comes to backup and remote replication, costs that become unsustainable if your environment scales beyond a few hundred terabytes

Val Bercovici, in his article, talks about a hypothetical use case with HDFS in the role of a cache (or a primary file system) and NetApp as a secondary repository.

This way HDFS comes up on top compared with what is usually sold as primary storage, and why would you use a primary storage for a secondary storage task?

Don’t get me wrong. I totally agree with the caching layer part, I’ve been talking about it for months, but I think secondary storage has to be the slowest, most automated, scalable and cheapest part of this kind of design. And this is where NetApp doesn’t really fit in, does it?

Analysing (big) data in place

Analysing (big) data is something I really like, but doing that on NFS and NetApp FAS is just too costly.

In my opinion, there are many limits and constraints that mean NFS and NetApp FAS is not the ideal solution, without mentioning the higher cost of NetApp FAS compared with better suited alternatives for this particular use case – at least in my view.

In fact, if you look at what is happening all around, enterprises are piling up data. Like it or not, they are starting to build data lakes. ONTAP File system (WAFL) and data volume limits, in terms of number of object and capacity, are just the first examples. As I recall the limit of a volume in size is still around 100TB).

Yes, you can configure a NetApp system for high capacity (and with large volumes) but then you might not get the performance – and you won’t have any of the advantages usually found in object-based systems.

Various object storage vendors are working on similar capabilities, proposing an HDFS interface on top of their platforms. Working with the same filesystem interface (internally and externally to the cluster) is much better at any level. And, going back to the first use case presented in Val’s blog, it also enables a seamless use of the object storage system for secondary copies of data.

More about

TIP US OFF

Send us news


Other stories you might like