Facebook bars crawls from all but select few
Use the API...bitch
Facebook has updated its robots.txt file so that the site can only be crawled by a short list of search engines, including Google, Microsoft's Bing, China's Baidu, Russia's Yandex, and a few others.
Previously, Facebook's robot.txt allowed anyone to crawl the site, although the company had threatened to sue at least one developer for crawling, before adding new terms of service that barred scraping without the company's written permission. Some — including programmer and blogger Pete Warden, the man who Facebook threatened to sue — had complained that the social networking site was breaking the rules of the interwebs. The site was allowing unfettered crawling, but the company's legal team was not.
"You've chosen to leave all that information out in the open so you can benefit from the search traffic, and instead try to change the established rules of the web so you can selectively sue anyone you decide is a threat," Warden told the company before it changed its robot.txt.
"The sad fact is, your leadership has decided to change the open rules that have allowed the web to be such an interesting and innovative place for the past decade."
Following Facebook's robot.txt change, Warden is pleased that the situation has been clarified. "I'm very happy that Facebook have done the right thing and abandoned their attempt to change the rules the web has operated under for the last 15 years," he says. "If you could still be sued despite following robots.txt, then the only large corporations with lots of money to pay lawyers could afford to build new search engines and we'd still be using Altavista instead of Google."
Uber Googler Matt Cutts is pleased as well. "A good move by Facebook to bring their robots.txt and related policies into line with internet standards," he said in a Tweet.
So, Facebook is now following the rules. But it's still creating a barrier to entry for new search engines and other crawlers. If you're not a major search engine, you still have to apply for written permission to crawl the site. And that benefits, well, Matt Cutts and Google.
"You're definitely right on that," Warden tells us. "Have the companies mentioned in [Facebook's] robots.txt actually signed the agreement they ask little guys to sign? Or are sites that drive a lot of traffic (including Yandex in Russia!) being given a sweetheart deal? I'll be very impressed if they've persuaded Google to sign up to [its] conditions."
Facebook threatened to sue Warden in April after he built tools that crawled and analyzed Facebook data for a service called fanpageanalytics.com.
Facebook CTO Bret Taylor indicates the company will grant crawling permission to any "legitimate" search outfit. "We will whitelist crawlers when legitimate companies contact us who want to crawl us (presumably search engines)," reads a blog post from Taylor.
Taylor says that the company should have updated the robots.txt sooner. "I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions. This was just a lapse of judgment." And he says that the company was merely trying to crack down on miscreants. It wants non-search services using the company's data API rather than crawling the site.
"Basically, [Facebook] users have complete control over their data, and as long as [the] user gives an application explicit consent, Facebook doesn't get in the way of the user using their data in your applications beyond basic protections like selling data to ad networks and other sleazy data collectors," he says.
"Crawling is a bit of special case. We have a privacy control enabling users to decide whether they want their profile page to show up in search engines. Many of the other 'crawlers' don't really meet user expectations...Some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to user privacy."
Facebook did not immediately respond to a request for comment. ®
Sponsored: Optimizing the hybrid cloud