Feeds

How to build your own Watson Jeopardy! supermachine

To rule humanity, download the following open source code...

Beginner's guide to SSL certificates

Ask mom for her credit card...

However, unless you're a geek sheik or a billionaire bit-basher, you're obviously not going to buy all this iron. But you could use your mom's credit card if you're working from your basement bedroom – or your wife's if you're working from your man cave in the garage – to reserve some server instances on Amazon's EC2 compute cloud.

We would go for the Cluster Compute Instances that Amazon announced last July. The Cluster Compute Instances deliver 33.5 EC2 compute units of power, running in 64-bit mode. They present 23GB of virtual memory to the operating system (that's not much) and the processors used in the physical hardware underneath the CCI slices are in a two-socket x64 server based on Intel's 2.93GHz Xeon X5570s.

That means each slice has 8 cores, 16 threads, and 23GB of memory. The nodes are interconnected with 10 Gigabit Ethernet switches. To match the core count of the Power-based Watson machine, you'd need 360 of these slices. To match the thread count, you'd need 720 slices. And to match the aggregate main memory, you'd need 712 boxes. So it looks like 720 boxes will do the trick, provided that the overhead of the Xen-based Amazon EC2 hypervisor is not too high. At $1.60 per hour for the CCI slices, you are in for $1,152 per hour. Trust me, your ma or your wife won't mind. It's all for the benefit of science.

The thing that makes Watson a question-and-answer machine and not just a cluster running Linux is a mountain of code that IBM has developed called DeepQA. You can see what little IBM has to say about the DeepQA stack here. Two key elements of the DeepQA stack are open source programs available through the Apache Software Foundation.

The first is Apache Hadoop, the open source distributed data-crunching system created by Doug Cutting after he read about Google's back-end infrastructure. Hadoop joined the Apache incubator program in 2005 and was a workable system by around 2008 or so.

The other key piece of code in the DeepQA stack that Watson ran is Apache UIMA – Unstructured Information Management Architecture – which is an information-management framework created by IBM database gurus back in 2005 to help them cope with unstructured information such as text, audio, and video streams. The UIMA code performs the natural-language processing (NLP is the term of art in AI) that parses text and helps Watson figure out what a Jeopardy! clue is about.

IBM has embedded UIMA functions in various systems programs it sells, the first being the OmniFind semantic search engine that Big Blue put into its DB2 data warehouses. IBM has proposed UIMA as an OASIS standard, and took it open source to get people on board with its way of creating frameworks for managing unstructured data. UIMA has frameworks for Java and C++, but could no doubt be extended to whatever language you wanted to code your Watson QA machine in.

Gondek tells El Reg that IBM used Prolog to do question analysis. Some Watson algorithms are written in C or C++, particularly where the speed of the processing is important. But Gondek says that most of the hundreds of algorithms that do question analysis, passage scoring, and confidence estimation are written in Java. So maybe you want to use a RHEL-JBoss stack for your Watson.

Now here is the real problem with a DIY Watson: the algorithms that IBM's DeepQA team created to teach Watson how to play Jeopardy! consist of about a million lines of code. That's going to take you and your friends a bit more than a few weekends to create. But, if you do it, you can launch a deep analytics startup and sell it to HP or Microsoft for ba-zillions.

Let me offer you a few pointers from Gondek for when you build your machine. First, don't stuff it full of anything you can find on the Internet. In creating Watson, IBM's researchers figured out that authoritative texts like the Oxford English Dictionary, Bartlett's Familiar Quotations, Wikipedia – yes, Wikipedia – and various encyclopedias were the best data sets suited to playing Jeopardy!. You want precise data, to be sure, but you don't want to surround it with so much extraneous text that the machine will be churning through tons o' text to find an answer.

For example, you don't put in Moby Dick, but instead lots of authoritative texts that talk about Moby Dick and pull out the important passages. As it turns out, Watson needed about 200 million pages of text, or about the equivalent of 1 million books, to play Jeopardy!.

The other key insight that Gondek offers is to really focus on the question-parsing algorithms. By finding out what the key words are in any sentence and dispensing with the noise, you can not only get to the answer faster, but do a better job of coming up with the correct answer.

These two insights are what turned Watson from a crap Jeopardy! player into a champion. Good luck building your own. And dominating the world. ®

Remote control for virtualized desktops

More from The Register

next story
NSA SOURCE CODE LEAK: Information slurp tools to appear online
Now you can run your own intelligence agency
Azure TITSUP caused by INFINITE LOOP
Fat fingered geo-block kept Aussies in the dark
Yahoo! blames! MONSTER! email! OUTAGE! on! CUT! CABLE! bungle!
Weekend woe for BT as telco struggles to restore service
Cloud unicorns are extinct so DiData cloud mess was YOUR fault
Applications need to be built to handle TITSUP incidents
Stop the IoT revolution! We need to figure out packet sizes first
Researchers test 802.15.4 and find we know nuh-think! about large scale sensor network ops
Turnbull should spare us all airline-magazine-grade cloud hype
Box-hugger is not a dirty word, Minister. Box-huggers make the cloud WORK
SanDisk vows: We'll have a 16TB SSD WHOPPER by 2016
Flash WORM has a serious use for archived photos and videos
Astro-boffins start opening universe simulation data
Got a supercomputer? Want to simulate a universe? Here you go
Do you spend ages wasting time because of a bulging rack?
No more cloud-latency tea breaks for you, users! Get a load of THIS
prev story

Whitepapers

Designing and building an open ITOA architecture
Learn about a new IT data taxonomy defined by the four data sources of IT visibility: wire, machine, agent, and synthetic data sets.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
5 critical considerations for enterprise cloud backup
Key considerations when evaluating cloud backup solutions to ensure adequate protection security and availability of enterprise data.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Protecting users from Firesheep and other Sidejacking attacks with SSL
Discussing the vulnerabilities inherent in Wi-Fi networks, and how using TLS/SSL for your entire site will assure security.