Feeds

Google's HTTP Archive merges with Internet Archive

One records pages. The other records speed

The essential guide to IT transformation

Velocity The HTTP Archive – a fledgling effort to record the performance of sites across the interwebs – has merged with the Internet Archive, whose Wayback Machine has long kept a similar record of internet content.

Google's Steve Souders – who founded the HTTP Archive and will continue to run it – announced the merger this morning at the O'Reilly Velocity conference in Santa Clara, California. The ultimate goal of the project is to improve the overall performance of the web by exposing its bottlenecks.

"I've had the idea of doing this for the past four or five years, where I saw that a large number of websites – even the most popular ones – weren't tracking very critical statistics about performance, like size of JavaScript or the number of script requests," Souders said.

"I thought [the project] had a lot of synergy with what the Internet Archive was doing. They were kind of two sides of the same coin. The Internet Archive – the Wayback Machine – is tracking the content of the web, whereas the HTTP Archive is tracking how that content is built and served."

Essentially, the HTTP Archive is now a sub-project of the Internet Archive, a not-for-profit based in San Francisco.

Souders also announced that in merging with the Internet Archive, the project has attracted several big name sponsors, including Google and Mozilla as well as New Relic and Strangeloop. New Relic offers an online service for measuring site performance, while Strangeloop provides a service for accelerating website load times.

Souders founded the HTTP Archive this past fall. Using the Webpagetest.org tool created by Google's Patrick Meenan, the project originally crawled about a thousand URLs, and a month later, it expanded to roughly 18,000. With those sponsors behind the project, the new goal is to track the performance of a million of the top sites.

Steve Souders

Steve Souders

Basically, the project runs sites through the Webtestpage batch API, and the results are shuttled into a MySQL base available to world+dog. The tests track not only how fast pages are, but how they serve their content and how much data is downloaded.

During a lightning demonstration at today's conference – where Souders is co-chair – he compared the performance and makeup of the top 100 websites (by traffic) with the top 1000. With the top 100, for instance, the average page size is about 437KB, while the top 1000 sites average 690KB. In top the 100, 26 per cent of resource requests fail to use caching headers, compared to 40 per cent in the top 1000. And, predictably enough, the top 100 also use significantly less Flash (36 per cent versus 50 per cent).

Meenan, who also spoke at today's conference, created Webpagetest while at AOL, but this fall, he was hired away by Google, which put him to work fulltime on Webpagetest and beefed up the project with additional engineering resources. Both Webpagetest and the HTTP Archive are open source projects. Webpagetest is under a BSD License. The HTTP Archive is under an Apache license.

The HTTP Archive has not yet accepted patches, but it has about six contributors at this point, including Souders and Meenan. You can readily browse data at HTTPArchive.org or you can download data as a MySQL dump.

One of Google's core missions is to improve the speed of the web, from one end to the other. Souders has long been at the forefront of the company's efforts to improve site load times. Previously, he was chief of performance at Yahoo!, where he built the company's YSlow performance tool. ®

The essential guide to IT transformation

More from The Register

next story
Microsoft boots 1,500 dodgy apps from the Windows Store
DEVELOPERS! DEVELOPERS! DEVELOPERS! Naughty, misleading developers!
Apple promises to lift Curse of the Drained iPhone 5 Battery
Have you tried turning it off and...? Never mind, here's a replacement
Mozilla's 'Tiles' ads debut in new Firefox nightlies
You can try turning them off and on again
Linux turns 23 and Linus Torvalds celebrates as only he can
No, not with swearing, but by controlling the release cycle
Scratched PC-dispatch patch patched, hatched in batch rematch
Windows security update fixed after triggering blue screens (and screams) of death
This is how I set about making a fortune with my own startup
Would you leave your well-paid job to chase your dream?
prev story

Whitepapers

5 things you didn’t know about cloud backup
IT departments are embracing cloud backup, but there’s a lot you need to know before choosing a service provider. Learn all the critical things you need to know.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Backing up Big Data
Solving backup challenges and “protect everything from everywhere,” as we move into the era of big data management and the adoption of BYOD.
Consolidation: The Foundation for IT Business Transformation
In this whitepaper learn how effective consolidation of IT and business resources can enable multiple, meaningful business benefits.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?