Feeds

Facebook bars crawls from all but select few

Use the API...bitch

3 Big data security analytics techniques

Facebook has updated its robots.txt file so that the site can only be crawled by a short list of search engines, including Google, Microsoft's Bing, China's Baidu, Russia's Yandex, and a few others.

Previously, Facebook's robot.txt allowed anyone to crawl the site, although the company had threatened to sue at least one developer for crawling, before adding new terms of service that barred scraping without the company's written permission. Some — including programmer and blogger Pete Warden, the man who Facebook threatened to sue — had complained that the social networking site was breaking the rules of the interwebs. The site was allowing unfettered crawling, but the company's legal team was not.

"You've chosen to leave all that information out in the open so you can benefit from the search traffic, and instead try to change the established rules of the web so you can selectively sue anyone you decide is a threat," Warden told the company before it changed its robot.txt.

"The sad fact is, your leadership has decided to change the open rules that have allowed the web to be such an interesting and innovative place for the past decade."

Following Facebook's robot.txt change, Warden is pleased that the situation has been clarified. "I'm very happy that Facebook have done the right thing and abandoned their attempt to change the rules the web has operated under for the last 15 years," he says. "If you could still be sued despite following robots.txt, then the only large corporations with lots of money to pay lawyers could afford to build new search engines and we'd still be using Altavista instead of Google."

Uber Googler Matt Cutts is pleased as well. "A good move by Facebook to bring their robots.txt and related policies into line with internet standards," he said in a Tweet.

So, Facebook is now following the rules. But it's still creating a barrier to entry for new search engines and other crawlers. If you're not a major search engine, you still have to apply for written permission to crawl the site. And that benefits, well, Matt Cutts and Google.

"You're definitely right on that," Warden tells us. "Have the companies mentioned in [Facebook's] robots.txt actually signed the agreement they ask little guys to sign? Or are sites that drive a lot of traffic (including Yandex in Russia!) being given a sweetheart deal? I'll be very impressed if they've persuaded Google to sign up to [its] conditions."

Facebook threatened to sue Warden in April after he built tools that crawled and analyzed Facebook data for a service called fanpageanalytics.com.

Facebook CTO Bret Taylor indicates the company will grant crawling permission to any "legitimate" search outfit. "We will whitelist crawlers when legitimate companies contact us who want to crawl us (presumably search engines)," reads a blog post from Taylor.

Taylor says that the company should have updated the robots.txt sooner. "I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions. This was just a lapse of judgment." And he says that the company was merely trying to crack down on miscreants. It wants non-search services using the company's data API rather than crawling the site.

"Basically, [Facebook] users have complete control over their data, and as long as [the] user gives an application explicit consent, Facebook doesn't get in the way of the user using their data in your applications beyond basic protections like selling data to ad networks and other sleazy data collectors," he says.

"Crawling is a bit of special case. We have a privacy control enabling users to decide whether they want their profile page to show up in search engines. Many of the other 'crawlers' don't really meet user expectations...Some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to user privacy."

Facebook did not immediately respond to a request for comment. ®

High performance access to file storage

More from The Register

next story
Dropbox defends fantastically badly timed Condoleezza Rice appointment
'Nothing is going to change with Dr. Rice's appointment,' file sharer promises
Audio fans, prepare yourself for the Second Coming ... of Blu-ray
High Fidelity Pure Audio – is this what your ears have been waiting for?
Did a date calculation bug just cost hard-up Co-op Bank £110m?
And just when Brit banking org needs £400m to stay afloat
MtGox chief Karpelès refuses to come to US for g-men's grilling
Bitcoin baron says he needs another lawyer for FinCEN chat
Zucker punched: Google gobbles Facebook-wooed Titan Aerospace
Up, up and away in my beautiful balloon flying broadband-bot
Apple DOMINATES the Valley, rakes in more profit than Google, HP, Intel, Cisco COMBINED
Cook & Co. also pay more taxes than those four worthies PLUS eBay and Oracle
It may be ILLEGAL to run Heartbleed health checks – IT lawyer
Do the right thing, earn up to 10 years in clink
prev story

Whitepapers

Top three mobile application threats
Learn about three of the top mobile application security threats facing businesses today and recommendations on how to mitigate the risk.
Combat fraud and increase customer satisfaction
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
SANS - Survey on application security programs
In this whitepaper learn about the state of application security programs and practices of 488 surveyed respondents, and discover how mature and effective these programs are.