Feeds

Sphinx - text search The Pirate Bay way

Like MySQL. But it can scale

Providing a secure and efficient Helpdesk

In 2001, Andrew Aksyonoff had an itch. He ran a website that indexed song lyrics and allowed users to search them. At the time, none of the open source options for text search worked. Lucene, the standard response at an engineering team meeting when the problem of search comes up, was still a baby. Indexing took too long, and it was a bear to install on anything other than a Lucene developer's machine.

Mnogosearch - the search engine you probably still haven't heard of - was too bloated and took a full twenty-four hours to index Aksyonoff's meager 150 megabytes of data. On top of all this, the standard algorithms for text search were about as useful as a venereal disease when it came to searching song lyrics. Andrew had a different idea of how text search should work - and the spare time to try it out. He did, and today, his software powers the search engines behind Craigslist, The Pirate Bay, and many more.

It's called Sphinx, and it's on track to become the open source world's canonical answer to the question of text search. MySQL and Solr, the two popular solutions, are showing their age. MySQL introduced full-text search in late 2000 as a way to more intelligently search blobs of text stored in databases. You can work a full-text clause into a query, and MySQL will rank the result rows by how relevant it thinks they are to the query.

MySQL uses textbook search algorithms and doesn't allow for a lot of relevance tuning. It's like a drawing from a five year old: The heart is in the right place, but everybody knows that kids suck at drawing. Implementation details aside, MySQL still suffers from scalability problems. Having ignored the trend of chip manufacturers to build multiple cores into CPUs, hoping that this unpleasant trend that required them to actually think about multi-threading would just blow over sooner or later, MySQL's ability to handle parallelism is, well, see the five year old's drawing.

Internet-famous MySQL wonk Jeremy Zawodny, who had the foresight to jump from the ship's bow as Yahoo started to take on water, replaced MySQL fulltext search at Craigslist with Sphinx. Craigslist used 25 machines to handle roughly 50 million queries per day on MySQL. Under that kind of load, Zawodny found that MySQL wasn't using much CPU or doing much disk I/O, which means it's spending all of its time waiting on thread locks. Oops.

Maybe we should have paid attention to parallelism after all. The Sphinx implementation took those 25 machines down to 10, with plenty of room to grow. While Sphinx didn't handle the traffic out of the box at the time, Zawodny was able to patch it to handle Craigslist's specific need - and fix a few bugs along the way.

Secure remote control for conventional and virtual desktops

More from The Register

next story
'Windows 9' LEAK: Microsoft's playing catchup with Linux
Multiple desktops and live tiles in restored Start button star in new vids
Not appy with your Chromebook? Well now it can run Android apps
Google offers beta of tricky OS-inside-OS tech
New 'Cosmos' browser surfs the net by TXT alone
No data plan? No WiFi? No worries ... except sluggish download speed
Greater dev access to iOS 8 will put us AT RISK from HACKERS
Knocking holes in Apple's walled garden could backfire, says securo-chap
NHS grows a NoSQL backbone and rips out its Oracle Spine
Open source? In the government? Ha ha! What, wait ...?
Google extends app refund window to two hours
You now have 120 minutes to finish that game instead of 15
Intel: Hey, enterprises, drop everything and DO HADOOP
Big Data analytics projected to run on more servers than any other app
prev story

Whitepapers

Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Providing a secure and efficient Helpdesk
A single remote control platform for user support is be key to providing an efficient helpdesk. Retain full control over the way in which screen and keystroke data is transmitted.