Original URL: http://www.theregister.co.uk/2012/12/21/amazon_ec2_fat_storage_data_pipeline/

Amazon fluffs up fat EC2 images for big data munching

Flips switch on Data Pipeline automagic bit shifter

By Timothy Prickett Morgan

Posted in Cloud, 21st December 2012 21:21 GMT

The big fat storage instances that Amazon Web Services was promising to deliver back at its re:Invent user conference in November are now shipping, and we now know a few more things about them – such as how expensive they are.

Amazon has also fired up the Data Pipeline service, which is a workflow-based tool for moving data between various AWS services and into and out of third-party databases and data stores.

The High Storage Eight Extra Large instance, abbreviated hs1.8xlarge by AWS, has 117GB of virtual memory with two dozen 2TB drives for a total of 48TB of capacity associated with it. It has 16 virtual cores assigned to it for a total of 35 EC2 Compute Units (EC2s) of processing power, which is a little less than half of the generic 8XL EC2 instance on which it is based, which has 88 ECUs of virtual oomph. AWS said in a blog post that those local drives in the physical server can deliver 2.4GB/sec of I/O performance through the customized Xen hypervisor that underlies all EC2 instances on the AWS cloud.

Amazon recommends that customers using these High Storage instances turn on RAID 1 mirroring or RAID 5 or 6 data striping and parity protection to secure their data, and says further that a clustered file system such as Gluster (also known as Red Hat Storage Server if you use the commercial version) or a distributed storage system such as the Hadoop Distributed File System (HDFS) to provide fault tolerance. And, as you might expect, Amazon also wants customers to back up the data they put on these storage-heavy compute nodes onto its S3 object storage.

Amazon says that the High Storage instance is aimed at Hadoop data munching, data warehousing, log processing, and seismic-analysis workloads where having lots of local storage on the nodes and high sequential I/O are important.

At the moment, the High Storage instances are only available from Amazon's US East region in northern Virginia, and other regions around the globe will get these fat storage nodes in the coming months.

And they're not cheap, at $4.60 per hour for on-demand instances running Linux and $4.931 per hour running Windows. A regular8XL instance (also known as a Cluster Compute instance) costs $2.40 per hour running Linux and $2.97 per hour running Windows. Those 8XL instances have a little more than twice as much compute, but hardly any local storage. That's US East region pricing on EC2; other regions will have slight different pricing.

The High Storage instances are being used for Amazon's own Redshift data warehousing service and are options for the Elastic MapReduce Hadoop service, as well.

On Friday, Amazon also turned on its Data Pipeline service so customers can start using it, as you can see in this blog post. The service provides a workflow to automatically move information from Amazon's S3, Relational Data Service database, DynamoDB NoSQL data store, and Elastic MapReduce Hadoopery, or into it from applications or across these various services as data is chewed and sorted for various applications.

Data Pipeline has a free usage tier, just like EC2 instances, and at the moment is only available in the US East region, just like the fat storage server slices. You can run five "low frequency" activities, which means they are scheduled to run no more than once a day, in this free tier. The High frequency tier is not free, and it is for data movements that occur more than once a day.

You have to use Amazon's graphical tool to build pipelines to move data between services, and you pay 60 cents per month for a low-frequency data movement and $1 per month for high-frequency data movements. You have to pay $1 per month for each inactive pipeline you have set up but not used, and if you want to do data movements either out to or in from outside data sources, then a low-frequency data movement will cost you $1.50 per month to set up and $2.50 per month if you do it more than once a day.

These Data Pipeline service fees do not include any bandwidth or storage fees associated with core AWS infrastructure services. ®