Feeds

Sphinx - text search The Pirate Bay way

Like MySQL. But it can scale

Intelligent flash storage arrays

In 2001, Andrew Aksyonoff had an itch. He ran a website that indexed song lyrics and allowed users to search them. At the time, none of the open source options for text search worked. Lucene, the standard response at an engineering team meeting when the problem of search comes up, was still a baby. Indexing took too long, and it was a bear to install on anything other than a Lucene developer's machine.

Mnogosearch - the search engine you probably still haven't heard of - was too bloated and took a full twenty-four hours to index Aksyonoff's meager 150 megabytes of data. On top of all this, the standard algorithms for text search were about as useful as a venereal disease when it came to searching song lyrics. Andrew had a different idea of how text search should work - and the spare time to try it out. He did, and today, his software powers the search engines behind Craigslist, The Pirate Bay, and many more.

It's called Sphinx, and it's on track to become the open source world's canonical answer to the question of text search. MySQL and Solr, the two popular solutions, are showing their age. MySQL introduced full-text search in late 2000 as a way to more intelligently search blobs of text stored in databases. You can work a full-text clause into a query, and MySQL will rank the result rows by how relevant it thinks they are to the query.

MySQL uses textbook search algorithms and doesn't allow for a lot of relevance tuning. It's like a drawing from a five year old: The heart is in the right place, but everybody knows that kids suck at drawing. Implementation details aside, MySQL still suffers from scalability problems. Having ignored the trend of chip manufacturers to build multiple cores into CPUs, hoping that this unpleasant trend that required them to actually think about multi-threading would just blow over sooner or later, MySQL's ability to handle parallelism is, well, see the five year old's drawing.

Internet-famous MySQL wonk Jeremy Zawodny, who had the foresight to jump from the ship's bow as Yahoo started to take on water, replaced MySQL fulltext search at Craigslist with Sphinx. Craigslist used 25 machines to handle roughly 50 million queries per day on MySQL. Under that kind of load, Zawodny found that MySQL wasn't using much CPU or doing much disk I/O, which means it's spending all of its time waiting on thread locks. Oops.

Maybe we should have paid attention to parallelism after all. The Sphinx implementation took those 25 machines down to 10, with plenty of room to grow. While Sphinx didn't handle the traffic out of the box at the time, Zawodny was able to patch it to handle Craigslist's specific need - and fix a few bugs along the way.

Providing a secure and efficient Helpdesk

More from The Register

next story
Preview redux: Microsoft ships new Windows 10 build with 7,000 changes
Latest bleeding-edge bits borrow Action Center from Windows Phone
Google opens Inbox – email for people too thick to handle email
Print this article out and give it to someone tech-y if you get stuck
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
FTDI yanks chip-bricking driver from Windows Update, vows to fight on
Next driver to battle fake chips with 'non-invasive' methods
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Entity Framework goes 'code first' as Microsoft pulls visual design tool
Visual Studio database diagramming's out the window
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.