Original URL: http://www.theregister.co.uk/2011/02/21/ibm_watson_qa_system/

How to build your own Watson Jeopardy! supermachine

To rule humanity, download the following open source code...

By Timothy Prickett Morgan

Posted in HPC, 21st February 2011 03:00 GMT

If you don't want your own Watson question-and-answer machine after watching the supercomputer whup the human race on Jeopardy! last week, you must be a lawyer. Only lawyers think they already have all the answers.

But if you grew up watching Robbie the Robot in Lost in Space, HAL in 2001: A Space Odyssey, the unnamed but certainly capitalized Computer in Star Trek, R2D2 in Star Wars, Ahh-nold in The Terminator, and Number Six in Battlestar Galactica – we'll stop now – you desperately want a Watson: something that can answer all your questions and maybe even rule the world. So why not build your own Watson-style QA machine?

As it turns out, the basic foundations are there for the taking.

Let's start with the iron – it really isn't that much hardware, after all. With the beta version of the Watson software, IBM started out with a few racks of its BlueGene/P parallel supercomputers, a grandson of the Deep Blue RS/6000 SP PowerParallel machine that played a chess tournament against Gary Kasparov – and beat him – back in 1997. But because the Watson effort was not just a technical challenge, but also a killer marketing campaign for the current Power7-based Power Systems lineup, Big Blue eventually switched the Watson DeepQA software stack to a cluster of Power 750 midrange servers.

To have enough memory and bandwidth to store all the necessary data, IBM put 90 of these Power 750 servers into ten server racks. Each server is configured with four of IBM's eight-core Power7 chips running at 3.55GHz. That gives Watson 2,880 cores and 11,520 threads on which to run its software stack. If the DeepQA software is thread-heavy – and there's every reason to believe it is – you'll need iron with lots of threads.

The 90 servers underpinning the Watson machine had a combined 16TB of main memory, but it looks like that was not evenly distributed across the nodes. The math works out to 182GB per machine, which is a silly, non-base-2 number.

David Gondek – who was in on both the system strategy and algorithms teams behind the "Blue J" project, as Watson was known internally – tells El Reg that the DeepQA system creates an in-memory database of the information that's pumped into the system. The machines are networked together, obviously, but being a software guy, Gondek didn't know what network IBM used. I would guess 40Gb/sec InfiniBand or 10 Gigabit Ethernet with Remote Direct Memory Access (RDMA) support to speed up the communication between nodes. Gondek said that the data that's put on memory and disk is replicated and distributed around the system for both speed and high availability.

The Watson box has 4TB of data capacity, which is not all that much, really. IBM did not say if it was disk drives or flash, but if most of the data used by Watson is stored in main memory, there is no reason to use the more expensive flash technology. But what the heck. Let's use flash anyway so it doesn't run so hot.

Because Linux is the fastest operating system on IBM's Power platforms (at least according to the SPEC family of benchmarks), Big Blue chose a variant of Linux to run on the Power 750 nodes. In this case, Novell's SUSE Linux Enterprise Server 11. SLES has a lot of tuning for HPC workloads and dominates supercomputers – although Red Hat is getting some traction in HPC now as Novell's fate has been uncertain for the past year or so: SGI has certified both SLES 11 and RHEL 6 on its latest massively parallel boxes as well as Windows Server 2008, where it formerly only did SLES on prior iron.

Ask mom for her credit card...

However, unless you're a geek sheik or a billionaire bit-basher, you're obviously not going to buy all this iron. But you could use your mom's credit card if you're working from your basement bedroom – or your wife's if you're working from your man cave in the garage – to reserve some server instances on Amazon's EC2 compute cloud.

We would go for the Cluster Compute Instances that Amazon announced last July. The Cluster Compute Instances deliver 33.5 EC2 compute units of power, running in 64-bit mode. They present 23GB of virtual memory to the operating system (that's not much) and the processors used in the physical hardware underneath the CCI slices are in a two-socket x64 server based on Intel's 2.93GHz Xeon X5570s.

That means each slice has 8 cores, 16 threads, and 23GB of memory. The nodes are interconnected with 10 Gigabit Ethernet switches. To match the core count of the Power-based Watson machine, you'd need 360 of these slices. To match the thread count, you'd need 720 slices. And to match the aggregate main memory, you'd need 712 boxes. So it looks like 720 boxes will do the trick, provided that the overhead of the Xen-based Amazon EC2 hypervisor is not too high. At $1.60 per hour for the CCI slices, you are in for $1,152 per hour. Trust me, your ma or your wife won't mind. It's all for the benefit of science.

The thing that makes Watson a question-and-answer machine and not just a cluster running Linux is a mountain of code that IBM has developed called DeepQA. You can see what little IBM has to say about the DeepQA stack here. Two key elements of the DeepQA stack are open source programs available through the Apache Software Foundation.

The first is Apache Hadoop, the open source distributed data-crunching system created by Doug Cutting after he read about Google's back-end infrastructure. Hadoop joined the Apache incubator program in 2005 and was a workable system by around 2008 or so.

The other key piece of code in the DeepQA stack that Watson ran is Apache UIMA – Unstructured Information Management Architecture – which is an information-management framework created by IBM database gurus back in 2005 to help them cope with unstructured information such as text, audio, and video streams. The UIMA code performs the natural-language processing (NLP is the term of art in AI) that parses text and helps Watson figure out what a Jeopardy! clue is about.

IBM has embedded UIMA functions in various systems programs it sells, the first being the OmniFind semantic search engine that Big Blue put into its DB2 data warehouses. IBM has proposed UIMA as an OASIS standard, and took it open source to get people on board with its way of creating frameworks for managing unstructured data. UIMA has frameworks for Java and C++, but could no doubt be extended to whatever language you wanted to code your Watson QA machine in.

Gondek tells El Reg that IBM used Prolog to do question analysis. Some Watson algorithms are written in C or C++, particularly where the speed of the processing is important. But Gondek says that most of the hundreds of algorithms that do question analysis, passage scoring, and confidence estimation are written in Java. So maybe you want to use a RHEL-JBoss stack for your Watson.

Now here is the real problem with a DIY Watson: the algorithms that IBM's DeepQA team created to teach Watson how to play Jeopardy! consist of about a million lines of code. That's going to take you and your friends a bit more than a few weekends to create. But, if you do it, you can launch a deep analytics startup and sell it to HP or Microsoft for ba-zillions.

Let me offer you a few pointers from Gondek for when you build your machine. First, don't stuff it full of anything you can find on the Internet. In creating Watson, IBM's researchers figured out that authoritative texts like the Oxford English Dictionary, Bartlett's Familiar Quotations, Wikipedia – yes, Wikipedia – and various encyclopedias were the best data sets suited to playing Jeopardy!. You want precise data, to be sure, but you don't want to surround it with so much extraneous text that the machine will be churning through tons o' text to find an answer.

For example, you don't put in Moby Dick, but instead lots of authoritative texts that talk about Moby Dick and pull out the important passages. As it turns out, Watson needed about 200 million pages of text, or about the equivalent of 1 million books, to play Jeopardy!.

The other key insight that Gondek offers is to really focus on the question-parsing algorithms. By finding out what the key words are in any sentence and dispensing with the noise, you can not only get to the answer faster, but do a better job of coming up with the correct answer.

These two insights are what turned Watson from a crap Jeopardy! player into a champion. Good luck building your own. And dominating the world. ®