ParAccel flashes data warehouses
Thinking in columns
ParAccel - one of the many upstarts that is chasing the data warehousing and analytics dollars these days - has tweaked its ParAccel Analytic Database 2.0 software and its underlying homegrown Linux operating system so that the x64 nodes on which it runs can be equipped with flash-based drives. And that, the company says, will boost query performance.
The ParAccel analytics database and the data warehousing clusters that are built using it are not just glorified relational databases that organize record information in rows and then scan it to do queries, but rather use a columnar format that organizes data by field. (The Sybase IQ database, which is used in several thousand data warehouses and which is distinct from the regular Adaptive Server relational database, also organizes data in columns).
Organizing relational databases by row is key for transaction processing, where you want to locate a record among zillions, read its data, and maybe modify it. But in a data warehouse, where you want to sort data and extract answers from the tables underlying the database, this row orientation gets in the way and slows everything down. Which is why Barry Zane, one of the founders of data warehousing appliance maker Netezza, left the company and started ParAccel in 2006.
Here's a simple example: Say you have a subset of the US census data with 10 different answers to questions stored in 10 fields in a relational database. Say each one has 10 bytes of data. In the row-oriented relational data warehouse, if you want to ask a question about the state and age of citizens, you have to scan all ten fields, for a total of 1,000 bytes.
But in a columnar database, you know you only want to look at the age and state columns, and you are only scanning 200 bytes to do a query. With the ParAccel database, you run the database in a shared-nothing, massively parallel cluster of servers with a mix of local server and remote SAN storage, and you can radically speed up table scans and queries as well as loading of data onto the database because everything is parallelized.
The addition of flash to PADB 2.0, which started shipping in June, doesn't boost performance as much as you might expect, and that's because of the clever things that the database already does with local and remote storage to goose performance. According to Kim Stanick, vice president of marketing at ParAccel, customers should expect about a 15 per cent performance boost if they add some flash drives to their x64 servers, and when the reduced power consumption is taken into account, they might see a 25 per cent increase in queries per watt. That's nothing to shake a stick at, but it is not the kind of performance improvement you would expect given the very high I/O rates of flash drives.
ParAccel got its start as an appliance maker front-ending Microsoft's SQL Server database to speed up queries and has gradually transformed itself into a seller of a free-standing database for analytics and data warehousing. The software runs on a stripped down version of Red Hat Linux, which ParAccel has cut all the fat out of and is given just the features needed to run the database.
The software is supported on just about any x64 server, and starting at the end of 2008, the company tapped EMC's Clarrion CX4 arrays as the preferred SAN storage for companies that wanted to use a mix of local and SAN storage for their data warehouses. This was called the Scalable Analytic Appliance.
In May of this year, the SAA II appliance was announced using the Clariion CX4 arrays. The base configuration of this setup comes with eight x64 server nodes and a CX4 model 240 array. (This is a soft bundle, meaning you have to buy the parts yourself, but they are certified to work together). With today's announcement, customers can plug in any flash-based storage device that goes directly into the PCI-Express bus of the server, which is what the database and the operating system can see. You can't use a disk controller with lots of flash drives hanging off it since the database doesn't know how to talk to the controllers; it wants to talk directly to flash.
In the sample rack configuration of the SAA II appliance, ParAccel has eight two-socket Dell 2U PowerEdge servers as compute nodes; each has 24 small form factor 500 GB SAS disks, four Gigabit Ethernet or two 10 Gigabit Ethernet ports. The rack includes one leader server node for managing the database cluster nodes in the rack and a hot standby server. The rack has a CX4-240 or CX4-480 array, which can house up to 60 2 TB disks. With compression on the data, this setup has an effective capacity north of 500 TB.
This setup can deliver database scans on the order of 2,400 MB/sec per server, according to Stanick. Shifting to a flash configuration that uses eighteen dual-socket Xeon 5500 1U PowerEdge servers, each with three SAS drives and two Fusion-io 640 GB flash drives plus the hot spares and the CX4 SAN also has an effective capacity of 500 TB (compressed). Given this, a single server node in the SAA II setup can deliver 2,800 MB/sec per server in database scans and takes up a smaller footprint and uses a lot less energy, too. By switching to flash for some of the local storage, you can get 2.6 times as much oomph chewing on that 500 TB of data.
In June, with the launch of PADB 2.0, a feature called blended scan started shipping, which is one of the reasons why adding lots of flash doesn't boost performance on an individual server. This feature already is boosting performance. Here's how it works. In a typical server node in a data warehouse cluster, each disk is mirrored (RAID 1) so a disk failure doesn't result in the loss of data. So if you have a typical four-node database cluster, with each node having eight drives, only half of them are doing useful work, yielding a scan rate of about 800 MB/sec.
If you hook the four nodes up to a SAN that has 56 mirrored disks, you might see a scan rate of 1,200 MB/sec. With blended scanning, which ParAccel is trying to get a patent on, you designate the disks out on the SAN as being the authorized copy of the data and you mirror there and then use the local disks on the server nodes as a cache for data. The scans run across a mix of the local and SAN disks, yielding a scan rate of 2,800 MB/sec (twice the rate of the four nodes because all the disks are doing useful work in the nodes plus making use of the SAN bandwidth).
The PADB 2.0 analytics database has a list price of $100,000 per TB, but discounts are available for volume purchases. ParAccel, which has several dozen paying customers (including PriceChopper, OfficeMax, Merkle, and Autometrics), also sells the software under a subscription model for $5,000 per TB per month. ®