ParAccel flashes data warehouses
Thinking in columns
ParAccel got its start as an appliance maker front-ending Microsoft's SQL Server database to speed up queries and has gradually transformed itself into a seller of a free-standing database for analytics and data warehousing. The software runs on a stripped down version of Red Hat Linux, which ParAccel has cut all the fat out of and is given just the features needed to run the database.
The software is supported on just about any x64 server, and starting at the end of 2008, the company tapped EMC's Clarrion CX4 arrays as the preferred SAN storage for companies that wanted to use a mix of local and SAN storage for their data warehouses. This was called the Scalable Analytic Appliance.
In May of this year, the SAA II appliance was announced using the Clariion CX4 arrays. The base configuration of this setup comes with eight x64 server nodes and a CX4 model 240 array. (This is a soft bundle, meaning you have to buy the parts yourself, but they are certified to work together). With today's announcement, customers can plug in any flash-based storage device that goes directly into the PCI-Express bus of the server, which is what the database and the operating system can see. You can't use a disk controller with lots of flash drives hanging off it since the database doesn't know how to talk to the controllers; it wants to talk directly to flash.
In the sample rack configuration of the SAA II appliance, ParAccel has eight two-socket Dell 2U PowerEdge servers as compute nodes; each has 24 small form factor 500 GB SAS disks, four Gigabit Ethernet or two 10 Gigabit Ethernet ports. The rack includes one leader server node for managing the database cluster nodes in the rack and a hot standby server. The rack has a CX4-240 or CX4-480 array, which can house up to 60 2 TB disks. With compression on the data, this setup has an effective capacity north of 500 TB.
This setup can deliver database scans on the order of 2,400 MB/sec per server, according to Stanick. Shifting to a flash configuration that uses eighteen dual-socket Xeon 5500 1U PowerEdge servers, each with three SAS drives and two Fusion-io 640 GB flash drives plus the hot spares and the CX4 SAN also has an effective capacity of 500 TB (compressed). Given this, a single server node in the SAA II setup can deliver 2,800 MB/sec per server in database scans and takes up a smaller footprint and uses a lot less energy, too. By switching to flash for some of the local storage, you can get 2.6 times as much oomph chewing on that 500 TB of data.
In June, with the launch of PADB 2.0, a feature called blended scan started shipping, which is one of the reasons why adding lots of flash doesn't boost performance on an individual server. This feature already is boosting performance. Here's how it works. In a typical server node in a data warehouse cluster, each disk is mirrored (RAID 1) so a disk failure doesn't result in the loss of data. So if you have a typical four-node database cluster, with each node having eight drives, only half of them are doing useful work, yielding a scan rate of about 800 MB/sec.
If you hook the four nodes up to a SAN that has 56 mirrored disks, you might see a scan rate of 1,200 MB/sec. With blended scanning, which ParAccel is trying to get a patent on, you designate the disks out on the SAN as being the authorized copy of the data and you mirror there and then use the local disks on the server nodes as a cache for data. The scans run across a mix of the local and SAN disks, yielding a scan rate of 2,800 MB/sec (twice the rate of the four nodes because all the disks are doing useful work in the nodes plus making use of the SAN bandwidth).
The PADB 2.0 analytics database has a list price of $100,000 per TB, but discounts are available for volume purchases. ParAccel, which has several dozen paying customers (including PriceChopper, OfficeMax, Merkle, and Autometrics), also sells the software under a subscription model for $5,000 per TB per month. ®