Feeds

Facebook bars crawls from all but select few

Use the API...bitch

Choosing a cloud hosting partner with confidence

Facebook has updated its robots.txt file so that the site can only be crawled by a short list of search engines, including Google, Microsoft's Bing, China's Baidu, Russia's Yandex, and a few others.

Previously, Facebook's robot.txt allowed anyone to crawl the site, although the company had threatened to sue at least one developer for crawling, before adding new terms of service that barred scraping without the company's written permission. Some — including programmer and blogger Pete Warden, the man who Facebook threatened to sue — had complained that the social networking site was breaking the rules of the interwebs. The site was allowing unfettered crawling, but the company's legal team was not.

"You've chosen to leave all that information out in the open so you can benefit from the search traffic, and instead try to change the established rules of the web so you can selectively sue anyone you decide is a threat," Warden told the company before it changed its robot.txt.

"The sad fact is, your leadership has decided to change the open rules that have allowed the web to be such an interesting and innovative place for the past decade."

Following Facebook's robot.txt change, Warden is pleased that the situation has been clarified. "I'm very happy that Facebook have done the right thing and abandoned their attempt to change the rules the web has operated under for the last 15 years," he says. "If you could still be sued despite following robots.txt, then the only large corporations with lots of money to pay lawyers could afford to build new search engines and we'd still be using Altavista instead of Google."

Uber Googler Matt Cutts is pleased as well. "A good move by Facebook to bring their robots.txt and related policies into line with internet standards," he said in a Tweet.

So, Facebook is now following the rules. But it's still creating a barrier to entry for new search engines and other crawlers. If you're not a major search engine, you still have to apply for written permission to crawl the site. And that benefits, well, Matt Cutts and Google.

"You're definitely right on that," Warden tells us. "Have the companies mentioned in [Facebook's] robots.txt actually signed the agreement they ask little guys to sign? Or are sites that drive a lot of traffic (including Yandex in Russia!) being given a sweetheart deal? I'll be very impressed if they've persuaded Google to sign up to [its] conditions."

Facebook threatened to sue Warden in April after he built tools that crawled and analyzed Facebook data for a service called fanpageanalytics.com.

Facebook CTO Bret Taylor indicates the company will grant crawling permission to any "legitimate" search outfit. "We will whitelist crawlers when legitimate companies contact us who want to crawl us (presumably search engines)," reads a blog post from Taylor.

Taylor says that the company should have updated the robots.txt sooner. "I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions. This was just a lapse of judgment." And he says that the company was merely trying to crack down on miscreants. It wants non-search services using the company's data API rather than crawling the site.

"Basically, [Facebook] users have complete control over their data, and as long as [the] user gives an application explicit consent, Facebook doesn't get in the way of the user using their data in your applications beyond basic protections like selling data to ad networks and other sleazy data collectors," he says.

"Crawling is a bit of special case. We have a privacy control enabling users to decide whether they want their profile page to show up in search engines. Many of the other 'crawlers' don't really meet user expectations...Some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to user privacy."

Facebook did not immediately respond to a request for comment. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Scrapping the Human Rights Act: What about privacy and freedom of expression?
Justice minister's attack to destroy ability to challenge state
WHY did Sunday Mirror stoop to slurping selfies for smut sting?
Tabloid splashes, MP resigns - but there's a BIG copyright issue here
Google hits back at 'Dear Rupert' over search dominance claims
Choc Factory sniffs: 'We're not pirate-lovers - also, you publish The Sun'
EU to accuse Ireland of giving Apple an overly peachy tax deal – report
Probe expected to say single-digit rate was unlawful
Inequality increasing? BOLLOCKS! You heard me: 'Screw the 1%'
There's morality and then there's economics ...
While you queued for an iPhone 6, Apple's Cook sold shares worth $35m
Right before the stock took a 3.8% dive amid bent and broken mobe drama
4chan outraged by Emma Watson nudie photo leak SCAM
In the immortal words of Shaggy, it wasn't me us ... amirite?
prev story

Whitepapers

A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.