Speed is the essence of WAR
Data location, location, location
WAR on the cloud, Part 4 In part 3 I tested some different semi-cloudy solutions for mirrors of my site and I am in the process of replacing one dedicated WebVisions Linux machine with two virtual private system (VPS)es in separate AsiaPac countries, for less money in total. Ker-ching!
Cloud files: public or private?
But this still isn't really getting the full cloud religion, and in particular I haven't dealt with storage and the cloud elegantly. I have simply been letting my existing code manage local mirror caching, with no special cloud storage support at all.
At the moment, when running on a bare system (or VPS) as a mirror, when a request is made for one of the multimedia exhibits in the catalogue, the mirror feeds as much of it (if any) as it has locally from cache. Then the mirror streams the rest over from the master server while saving to local cache.
Thus for popular exhibits, after the first load from a given mirror, every other user of that mirror gets it streamed from the mirror's local cache for speed, and the first download is only a little slower than going to the master direct. This works, but mirrors don't all have the same outgoing bandwidth available, and bandwidth is most important for larger downloads.
(On Amazon's Elastic Beanstalk, if for some reason it decides that my mirror is taking too much CPU time, my mirror gets restarted and its cache discarded, which is wasteful. In part I mitigated that by not having such lightweight cloud mirrors do any significant pre-caching other than of the most popular content. If I could persist the cache through a restart I could reconsider this fix.)
Amazon and Rackspace both support (private) cloud-based storage and a public content delivery network (CDN), either of which might perform better than my existing solution, especially in conjunction with my lower-bandwidth mirrors.
The private persistent cloud storage could be used "behind" my mirrors with the mirrors as a shared concurrency-safe cache with the mirror front-ends protecting against excessive use of bandwidth, and/or the public cloud/CDN could serve appropriate files directly to end users.
Ambush bills again
With neither Rackspace nor Amazon is it possible to restrict download bandwidth and bills. Even if not feeling paranoid about DoS/DDoS attacks, pilfered bandwidth through hotlinking can be a significant nuisance. I checked with Rackspace and it's not currently possible to restrict downloads (say) to requests with a Referer header that matches a supplied regex, which would stop most casual misuse.
The lack of such first-line defences and any cost cap would imply that regular active monitoring is needed, and possibly only selectively making available via the CDN material less likely to be hotlinked (such as site static furniture) and a sampling of requests for popular items not currently setting off hotlink alarms.
On Amazon's AWS the public CDN is partitioned by geographical area (with extra complication and cost to distribute content to each region) whereas the Rackspace CDN (Akamai underneath) seems to be global. The enticing prospect therefore is that one Rackspace CDN could do much of the heavy lifting on behalf of all of the mirrors leaving them just to construct and serve the relatively small Web pages that the end user sees.
Rackspace's CDN is built with OpenStack which reduces the chance of having to throw away or redo any CDN integration work if I switch to a different CDN provider or use more than one.
CDN vs ADSL
I tested a simple face-off between Rackspace's public CDN and my home/office Apache Web site in London serving a reasonable-size binary file (a few MB), both for latency and bandwidth.
I tried pulling down the file to the following locations (within co-lo facilities, not retail broadband connections): UK (London), US (Atlanta), SG, AU (Sydney), IN (Mumbai). (Connectivity from the IN machine was sufficiently erratic and poor at the times I was testing that I have excluded it from the results.)
In general, when downloading the file from my office Apache, I could max out my link outbound at at little over 128KB/s, and the round-trip time (ie, latency) varied from 24ms to the UK machine up to 360ms for AU.
As I can construct and serve a page from one of my mirrors in typically 50ms or (much) less, these latencies to serve up page furniture are significant and annoying to end users, especially when exceeding 100ms, ie: for all but the UK.
When downloading a large file, latency is less significant than bandwidth.
With the Rackspace CDN, with content nominally uploaded to the "UK" cloud, latency to download to UK and US machines was under 1ms and the worst was SG at under 70ms which would have beaten all but the UK-to-UK serving from my own office Apache.
The bandwidth available from the Rackspace CDN was good everywhere too, maxing out my SG connection (at about 240kBytes/s) and getting as high as 22MBytes/s downloading to a UK host.
Latency and bandwidth
Note that the SG and AU servers were probably maxing out their in-bound links during the CDN tests (2Mbps and 10Mbps burstable) rather than indicating the maximum CDN bandwidth available.
Conclusion: Fit or fad?
The upshot of this simple experiment is that it is worthwhile in both latency and bandwidth terms to improve the user experience for small and large files (with perceived performance dominated by latency and bandwidth respectively) to consider taking advantage of a commodity CDN, such as that of Rackspace.
Indeed, even with my own faster mirrors I'd have difficulty matching the better CDN numbers, so if hotlinking and so on were not a worry and I wanted to maximise performance, then I should probably only serve the dynamic page content from my own mirrors, with all other material served by the CDN.
So come on Rackspace, Amazon, et al, gimme a way to control bandwidth and risks and you'll have one more customer and widen your appeal to other SMEs too... ®
Sponsored: Hyper-scale data management