It's not the size of your pipe, it's the way you use it

Original URL: https://www.theregister.com/2011/11/09/war_on_the_cloud_part_6/

How to manhandle CDN bandwidth like a pro

Posted in SaaS, 9th November 2011 13:03 GMT

WAR on the cloud 6 In part 5, I tuned my site's home/entry page to load faster than Google's, in part because page-load time and general responsiveness are important to retaining users and those "sticky eyeballs".

Now I want to make better use of my already-paid-for resources to handle assets that I didn't feel comfortable putting on a third-party network due to lack of control (potentially big bills from misuse by third parties for example). More stick for free...

In the end I punted only relatively less valuable media off to Rackspace's CDN (content delivery network) because of the lack of barriers to even casual abuse such as hotlinking – ie, embedding my images in someone else's page so that I end up paying for bandwidth every time their page is viewed.

Rackspace has now shown some interest in providing controls (such as a regex check on the Referer header), which is good, but it's still not enough for me to put my more valuable and larger assets (primarily images) on their CDN.

For interactivity it is usually most important to minimise latency (of which network round-trip-time is a big component), and thus serve the user's requests from a mirror as geographically close to the user as possible.

For larger objects (ie, more bytes, such as big images or videos) high bandwidth is more important to minimise the time to complete a request. (There are other issues such as packet loss and network congestion on long routes to be considered too, though.)

DIY CDN

I already bounce users to mirrors closer to them to minimise latency, so now I want to selectively serve (large) content from remote mirrors to maximise bandwidth.

Other than some new and exciting bugs I wrote into the code (and still haven't fully squashed), the split between the two modes was relatively simple. The results started to show fairly quickly after I rolled the code out on Monday to several of the mirrors. The bandwidth curve for the main UK mirror is starting to flatten to follow local time-of-day use less, in this case the UK mirror taking some load from the US mirror in the main I suspect.

Looking at the underlying numbers I can also see the two largest/fastest (UK and US) mirrors closer in their spare capacity more of the time than before.

See the bandwidth profile change a little at the start of the week when the new mechanism went (partially) live: it's flatter as more non-local traffic is carried.

The major beneficiaries are those viewers bounced to a mirror close to them that is relatively small (with low-ish bandwidth): since a lot of the material that they see on a page is now served from the Rackspace CDN or my internal 'high-bandwidth' CDN, those mirrors feel a whole lot faster. This makes them seem as snappy as I had intended all along. And it costs nothing extra beyond a little coding time.

To avoid confusing search-engine spiders (and end users), and to maximise robustness on a human's first page hit, I turn off most of the high-bandwidth re-routing magic in these two cases.

Tinker, Tailor

There is further optimisation (aka tinkering) possible of course: for example, for a mirror to continue to route high-bandwidth requests to itself if near the top of the spare-bandwidth league (eg, within say 25 per cent), not only if at the very top.

Another tweak to marginally improve responsiveness and avoid hiccups in underlying TCP connections when a very long or congested geographical route would otherwise be taken, is to instead pick a physically close mirror near to top of the bandwidth rankings, rather than just the top. Improved proximity may somewhat compensate for the nominally lower bandwidth and help to complete a user's page display quickly.

Another tweak would be to spread traffic between all mirrors near the top to induce less sloshing back and forth between candidates.

These changed should result in more robust and often smaller pages and more intuitive dependencies.

Lower carbon packet-print?

Another wrinkle – which is more a problem for the likes of Google and Facebook which have data centres for which power rates are time-dependent - is to move processing where electricity is cheaper and/or has lower carbon intensity (less CO₂ emitted per kWh consumed and thus user action performed). I already have my servers report a slightly lower available bandwidth (to draw in slightly less traffic) when running in energy-conserving mode for reasons like those, so while it doesn't save me any pennies it may well be trimming my footprint a little, assuming that the extra distance that packets may travel doesn't outweigh that.

Audience participation

Thanks for interesting comments on previous parts to this series.

One point that I didn't make clear is that security is not an enormous issue for the site described in this series beyond avoiding vandalism, XSS attacks, and generally being used as a weapon to attack my users and other sites with. The site doesn't use SSL and doesn't hold any personal details, and the only cookies used are benign session cookies, mainly noting a user's locale override, if anything.

I have put together sites, such as for retail finance, which would require serious effort on security issues, including the possibilities of an attacker using IPv6 and IPv4 together in unusual ways to find chinks in the armour.

Tagging for cache

In the comments to part 5, "theodore" asked what I meant by "redoing all my Last-Modified and ETag headers and my response to a browser's If -Modified-Since and If-None-Match" and if I meant making sure that all of my mirrored copies of files are identical both in content and timestamp?

I do indeed make sure that timestamps and content are identical between sites (except for brief transients when updates are being propagated to them all), though identical is a bit subtle here for pages with ads in them for example. Images will in the main be bit-for-bit identical (strongly identical in ETag terms) whichever mirror you visit, but the HTML pages will only be weakly identical because although the main information content may be the same, ads and links to other internal pages may be somewhat different for a number of reasons.

But the real issues that caught me out this time around were:

Being sure not to use the Last-Modified-Since header at all once any ETag matching had been done.
Actually making sure that all significantly-trafficked pages had a sensible ETag generated for them, which is harder than it appears at first blush with a heavily templated site and because of weakness issues mentioned above.
Finding that some objects could not realistically use ETag/Last-Modified at all if I was to avoid issues with browsers requesting duplicate copies when hopping between mirrors behind one 'main' DNA alias.
Lastly, finding that I have to repeat these headers and Cache-Control even on a 304 Not Modified response (as an upcoming IETF draught suggests should become mandatory) to avoid buggy browsers not updating their internal notion of when to re-request (ie, they should add on the original max-age but forget and so wrongly re-request every time after the initial expiry).

After having done that, I think I have gone a little bit too far, and have to row back on cache lifetime a touch, but generally I feel more in control of the process and have less random logic scattered around. ®