Tim Berners Lee goes postal on spam
The Greatest Living Briton explodes
So there we were. In a room devoted to Engineering, the man voted the Greatest Living Briton had exploded in front of me.
Sir Tim Berners Lee, co-inventor of the World Wide Web, was at Southampton University to deliver an inaugural lecture for School of Electronics and Computer Science, and promote his latest initiative.
There's a whole new science out there waiting to be explored, called "Web Science", and he was here to explain how. The Web Science Initiative was "an umbrella, with lots of projects" around the world, he said.
Flanked by the great and the good, Professors Nigel Shadbolt, vice president of the British Computer Society, and Wendy Hall, and James Hendler. Sir Tim said he hoped this would set the agenda for years to come.
This new science comes with some grand claims attached. Shadbolt said he hoped the web would attract a new kind of undergraduate to computer science departments, who presumably had been bored by all that old-fashioned science and engineering.
Shadbolt implied we could learn a lot about humanity from looking at the Interweb.
"What actually happens on the web when people participate is all psychology: it's more accurate to think of the web as humanity connected," he said.
"It's people: things get published by people, the blogs are made by people, the links between them that Google follows are made by people."
Hendler jumped in:
"I was the external reader for a paper for somebody here at ECS. The first line of her thesis is 'This Document Is About People'. And I put down explanation marks and "hurrays!" That's a student who's beginning to understand web science!"
(So that's how you get on in Computing, dear readers).
But this was all a bit much for your reporter.
"The assumption behind everything you've said is that this research will create some kind of knowledge," I asked. "The other assumption is that the links you'll be examining to provide this knowledge are generated by humans.
"Now when I search for a term on Google Blogsearch or Technorati, two thirds of those results are robots. People at Google tell me between twenty per cent and a third of the index is junk - Google doesn't know which third."
"So, er... what's your research going to be worth?"
"That's one of the good questions," said Professor Hall.
Sir Tim countered, and it's worth quoting in full.
Do you remember when the web came out, the first search engines before Google were famously terrible. They were famous for producing lists of junky answers. Originally when the web was small there was no problem finding things because it was a list of websites, and each one had a picture of what was on it. When it got bigger you got the problem of where to find something and that problem got more and more acute, and then...
Somebody, in fact Google, is nice web science example. They thought we can use this vector machine technology and it could solve the "where to find stuff" problem. So now the result is much more effective search engines. So yes, OK.
We silently wondered where this would lead. We didn't have to wait too much longer: the answer had almost arrived.
So one cycle further on... the spammers have gone to considerable amount of trouble to build the "fake web". I dont know what the proportion is but as you say, there's domain names all linked in to each other all generated by computer, all full OK, junk. Full of crap. So. There'll be another cycle. OK
Er, we thought. Is that it?
So when you see what's happening, OK, there are a lot of spammers out there. It's like spam for email, you know. Suddenly spammers deluged us but email was designed for a world where everybody was friendly - and spam happens where people are motivated by pure greed and not part of a friendly club. You can design email systems around that; the email system is being converted so it works in this environment. I'm sure web search engines so they work in this environment...
But, er how? Sir Tim's historical narrative had a few flaws in it. It seemed negligent not to point this out to the GLB.
" I remember AltaVista worked wonderfully - until it was gamed. Google was wonderful it was gamed - entropy keeps returning. We don't seem to be making much progress. Are you saying we will somehow fix it - magic will happen?" asked your reporter.
There were howls of distaste from the panel.
Wendy Hall's face was thunderous - and I realized I had not merely made an unwelcome expression at a revivalist meeting - I'd farted in the church.
Hendler leapt in to take the discussion off topic:
We're not making much progress until you do the maths. We're now doing as well with two or three more zeros at the end of how many pages are out there. So in a sense you have to run as fast as you can to stand in the same place. What a lot of the researchers Google look at is exactly how to keep scaling across.
That's a good answer - to a different question. It wasn't scale, but system integrity that we were talking about.
The GLB stuck to his account:
But AltaVista never gave good results. It never did this Eigenvector calculation. It never found the representative page for a given community. It would find pages which contained the page a lot of times it wouldn't do any link analysis. The older search engines before Eigenvector machine systems came out were just never as good. And even, still, yes, they're being spammed, [inaudible] but it hasn't made the search engines less usable they're still much more usable than the previous generation
But Google has a real problem, here, and it employs some of the cleverest people in the world.
" You'll have to figure out something Google can't figure out if your research is going to be worth anything," El Reg asked.
"How will you do that, exactly?"
Sir Tim replied:
It's about much more than people defending against abusing the system. There are a huge number of opportunities that will make the industry work more efficiently.
But he was getting snidey:
In your life, do you feel that spoofing of search engines is the main thing, the one that bugs you?
So, er .. the cornerstone of the New Economy wasn't a concern. (I'll explain why this answer was somewhat short of being satisfactory, but let's continue with what happened next.
Scenting blood, another reporter pursued the enquiry:
"So how are you going to stop the Semantic Web being poisoned?"
TBL, the GLB, replied:
Well, everybody who's building the semantic web pretty much that I know are building systems take data from lots of places, but take data with an awareness of where those places are. So for example, suppose you're getting Geotags and the OS runs a service, lots of people in this country might trust the OS to say this point has a church with a spire - other people might say it's a great church to go to, other people might say it's a heathen church to go to... those are the other sources of data...
There was no let up from the press:
"But that was the basis for Google, and Google got poisoned... "
Shadbolt and Hendler stepped in to shield Sir Tim, but he was seething at the impertinence:
I remember a conference, we were discussing the Semantic Web, and someone asked what do you think is the worst thing that can happen and all the pencils come out. I know you two have been asking about "Woargh - I know the one about... what about the bad guys? Won't we be phished" There's a temptation to give readers about all the terrible things out there OK, and all the ways the web can become less usable.
At this point, your reporter wanted to remind Sir Tim that of all the problems the web has, a hostile press is not one of them. In fact, you can't pick up a newspaper or magazine without reading about how it's ushering in a New Age of Enlightenment. Time magazine gave "Person Of The Year" to every web user in America - or at least every one who looked at the mirror Time placed on its front cover.
He continued, cryptically:
Yes you'll find a bank that's less usable - ... I've never been phished.
So the Greatest Living Briton has never been phished, which is a relief. His answer to the Semantic Web didn't inspire much confidence for the rest of us: it would be used within the firewall, amongst trusted groups, "areas where one is much less worrying about the bad guys".
Here's what I found both disquieting and depressing from the GBL.
I've asked similar questions to engineers in every field. Without exception, all have thought deeply about the consequences of their original design decisions, and express quite specific solutions. These are often quite radical rewrites - throwing out many of the assumptions they first made.
But not with the Web. It's a place to marvel and hope, and like Candide, hope for the best.
If you're not already evangelical, you probably don't have a part to play in the "new science" of the Web. ®