Quantcast gives the boot to Hadoop's HDFS

Boots up – and opens up – homegrown QFS alternative

  • alert
  • submit to reddit

Top three mobile application threats

There's no shortage of complaining about the limitations of the Hadoop Distributed File System that underpins the MapReduce algorithms of the Hadoop Big Data muncher, which is why quite a few companies have come up with alternatives. And now there is a new HDFS alternative from internet click counter Quantcast called – you guessed it – the Quantcast File System.

Quantcast was founded in 2006 in San Francisco, and is hired to do sampling of data by either web content publishers or the advertising agencies that are paying the bills. In either case, Quantcast founder and CEO Konrad Feldman tells El Reg that over 100 million "web destinations" are being tracked by Quantcast's giant Hadoop clusters. And as you might imagine, this generates quite a bit of data, which the company then dices and slices to provide demographics for websites and advertisers. (Who, once again, pay most of the bills around here on the Intertubes, whether we like it or not.) To get Quantcast going, a whole team of backers – including Revolution Ventures, Founders Fund, Polaris Venture Partners, Cisco Systems, and Glynn Capital Management – have paid most of the start-up's bills through three rounds of venture funding totaling $53.2m.

Quantcast was an early practitioner of Hadoopery, and it cranked up its first clusters in 2006 ahead of the launch of its Quantcast Measurement system for websites. Just six months later, it was receiving 1TB of data a day from the tracking tags that it has spread around the internet through content suppliers.

About a year later the company started running up against the limits of HDFS, following in the footsteps of Yahoo!, Facebook, and other big properties on the web before it, says Jim Kelly, vice president of research and development at Quantcast. In early 2008, the company started using the Kosmos File System (KFS) developed by Sriram Rao and eventually hired him to continue development of the KFS alternative to HDFS. (Rao now works as a research scientist at Microsoft.) KFS was used in some secondary systems in 2008 to boost performance, but eventually Quantcast decided to do a big development effort and significantly revamp KFS so it would scale a lot further and deliver more performance.

Timeline of Hadoop usage at Quantcast

Timeline of Hadoop usage at Quantcast (click to enlarge)

The company really had no choice. In early 2009, when it launched Quantcast Advertising, embedding its tags in ads as well as content, it was receiving 10TB of new data per day and processing over 1PB of data a day, dicing and slicing, mapping and reducing. And as the business grew, so did the data load. By early 2011, Quantcast was receiving 20TB a day and processing 10PB per day. Late last year, the QFS kicker to KFS was more or less done and ready for primetime, and the company spent weeks moving its production Hadoop clusters off HDFS and onto QFS. And as far as Feldman is concerned, it is a good thing the company did, because its data load has doubled again to 40TB inbound each day and its Hadoop clusters have to chew through 20PB a day for clients who want to know what you are doing out there on the intertubes.

Instead of keeping QFS all to itself, which it could have done because it is providing a service, not marketing it as a product, Quantcast is opening up QFS with its 1.0 production release and distributing it at GitHub under an Apache V2.0 license (which is the same license that KFS was under). "As we have grown our business, we have used a significant amount of open source software, and our developers want to give something back," says Kelly.

And neither Feldman nor Kelly has any intention of providing commercial support for Quantcast, but they are cool with someone else doing so. What they want is a community of developers who are hitting the ceiling with HDFS and who will help solve the petascale problems that it has.

Like HDFS, QFS is a distributed file system, but instead of implementing it in the Java language as is done with HDFS, Quantcast compiled it down much closer to the iron in a good old C++ compiler, which boosts its performance significantly. The file system is 100 per cent compatible with Hadoop, which doesn't know or care that it isn't using HDFS. The big difference is that the architecture and error recovery mechanisms for QFS are perhaps better suited to modern clusters than HDFS.

"When Hadoop was built, networks and disks were vying for the slowest things in the data center," says Kelly. "But now, networks are 10 times faster and disks are as slow as ever. So it makes sense to do a different optimization for a file system for Hadoop."

With Hadoop, because networks and disks were slow, the MapReduce algorithm dispatches work to a node where the NameNode in the Hadoop cluster knows a particular set of data exists. For throughput and data availability reasons, Hadoop keeps three copies of all data sets and spreads them out across a cluster. The assumption is that networks are slow, so chunk up the data and the chewing so you don't have to pass data over the network to processors.

With QFS, rather than making three copies of the data sets and spreading them around the cluster, the data is encoded using the Reed-Solomon algorithm, which is used to encode CDs and DVDs, among other things. This algorithm breaks data into 64KB stripes and then it generates three different parity stripes for every six data stripes. Each one of the nine resulting stripes is put on a different physical drive by the file system, and ideally every stripe is in a different rack (not disk or server node, but rack) in the cluster. I know that sounds insane, because you are going from local access of data over a PCI bus to a compute job – where by definition all of the data has to come in across the network. But every write is parallelized across nine disks and every read is parallelized across six drives, and when you move from Gigabit Ethernet up to 10GE or 40GE links or QDR or FDR InfiniBand links, the net effect is that QFS is faster than HDFS.

How much you ask? Well, take a gander:

Relative performance of HDFS and QFS

Relative performance of HDFS and QFS

The benchmark tests shown above were run on a production system with real Quantcast workloads on a cluster with over 1,000 nodes and a total of 6,500 disk drives. The metaserver (akin to the NameNode in a Hadoop HDFS setup) can build, inspect, or deconstruct a balanced tree with up to 75.7 million directories. Listing is not much faster, but both the stat command, which looks up the file size and timestamp on a chunk of data, and the making directories are clearly running a lot faster in QFS than in HDFS. (This test was done on the Cloudera CH4 distro of Hadoop.)

And on a 20TB Hadoop job, done on the open source Apache Hadoop 1.0, QFS can read data about 47 per cent faster and write it about 75 per cent faster. This makes sense when you consider that for 20TB of data, you have to write it three times in HDFS, but only by a factor of 1.5 through the Reed-Solomon algorithm. And when you do a 20TB read operation, you have to do the read on the whole data set, but the parallelism in having the read parallelized across six drives more than makes up for the network latencies.

Obviously, QFS is not for everybody. If you have a small Hadoop cluster and you are taking in dozens of gigabytes of data each day and chewing on dozens of terabytes each day on dozens of server nodes, then the open source Apache Hadoop or the commercial-grade distros from Cloudera, HortonWorks, MapR, IBM, or Greenplum/EMC will probably do you just fine.

Other potential QFS users are those who won't mind spending on networking hardware in exchange for not having to buy so many disk drives. Storage is a huge component of the cost of a Hadoop cluster, and being able to cut the drive count in half is a big deal on big clusters. It could even be as big a deal as getting better performance, in fact. ®

High performance access to file storage

More from The Register

next story
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Kingston DataTraveler MicroDuo: Turn your phone into a 72GB beast
USB-usiness in the front, micro-USB party in the back
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Inside the Hekaton: SQL Server 2014's database engine deconstructed
Nadella's database sqares the circle of cheap memory vs speed
BOFH: Oh DO tell us what you think. *CLICK*
$%%&amp Oh dear, we've been cut *CLICK* Well hello *CLICK* You're breaking up...
Just what could be inside Dropbox's new 'Home For Life'?
Biz apps, messaging, photos, email, more storage – sorry, did you think there would be cake?
AMD's 'Seattle' 64-bit ARM server chips now sampling, set to launch in late 2014
But they won't appear in SeaMicro Fabric Compute Systems anytime soon
Amazon reveals its Google-killing 'R3' server instances
A mega-memory instance that never forgets
prev story


Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.