Original URL: http://www.theregister.co.uk/2008/10/31/bi_sarah_palin/

Sarah Palin's words get data mined

Whoah, there's data here? Business analysis gets to work on VP transcripts

By Mark Whitehorn

Posted in Bootnotes, 31st October 2008 12:19 GMT

USA '08 Business Intelligence (BI) is about extracting information from data. The name implies that it is only applicable to business information, but that’s misleading. Given the right techniques, information can be found in the most unexpected places - even in the speeches of vice-presidential candidates.

Sarah Palin’s meteoric rise to fame as the Republican vice-presidential nominee is well documented. From the relatively obscure position of governor of Alaska she was unexpectedly catapulted into the spotlight as John McCain’s running mate in the US presidential elections.

Even before the twang of the ballista had faded, concerns began to be expressed (mainly, it has to be said, by Democrats) about her ability to govern in the event that she becomes president.

Some of these concerns have focused on her apparent lack of knowledge of important Republican tenets, such as the Bush doctrine:

Others point to her verbal abilities, which have been mercilessly parodied by the media.

It has been argued that mercy is inappropriate since some of the most damning parodies have used the senator’s own words in their original context. This video shows both the parody and the original:

Part of the trouble appears to be that as governor of Alaska she developed a homely, cod-parochial way of talkin’ that is really meant to, you know, appeal to Joe Sixpack and all the fiercely protective lipstick-pitbull hockey moms out there. It has been further suggested that the Republican party, mindful of her growing reputation of only opening her mouth to change feet, has been coaching her in an attempt to replace ‘Palin-speak’ with the more normal ‘political-speak’.

This is not to imply the introduction of meaningful content - simply the replacement of down-home references (‘good guy’, ‘Alaska’) with politically charged words that imply an understanding of the world outside her home state (‘Afghanistan’, ‘economy’).

One way to test this is, of course, to apply BI techniques to, say, a pair of transcripts – one from early in the campaign and one from later. The two that spring to mind are her interview with Katie Couric, followed by the Vice-Presidential Debate. And this is precisely what was has been done.

It is important before revealing the results to point out that this is not science. There are not enough words to provide a statistically significant result, the context for the transcripts was very different, the subjects discussed were clearly not the same, and on and on and on. So, there is no science here. None at all. But hey, it does allow us to, you know, have a laugh an’ all, so we say, what the heck.

The exercise was performed using Microsoft’s BI tools by a Microsoft employee at Redmond and the results were kindly made available to The Register. The transcripts are freely available and many of you have access to analysis tools, so why not have a go?

The transcripts were passed through a process that not only counts the number of times that a word is used but also assigns a ‘tf-idf’ weighting (term frequency–inverse document frequency) which gives some indication of the importance of the words in the document.

The (totally and absolutely not scientific) results are fascinating. Top of the list of the early Palin transcript is the term “good guy” with “bad guy” in 26th position. In the later debate she doesn’t use these terms at all. In the interests of fairness it is worth pointing out that Joe Biden (the opposition Vice Presidential candidate) uses the term “bad guy” once in the debate, so perhaps he needs some coaching too.

If we try to look for warm fuzzy patriotic words and compare their position in the early interview and then in the later debate, we find:

Word Interview Debate
Good guy 1 Never appears
Alaska 13 40
Freedom 15 80
Democratic value 16 Never appears
Face 22 170
Bad Guy 26 Never appears

Now suppose we look for words that might be considered to be more presidential – words which give the impression of a potential world leader:

Word Interview Debate
Afghanistan 125 2
People 40 4
Economy Never appears 8
Iraq 29 9
Job 144 12
Tax Never appears 13
War 68 17
Government 166 21
Nation Never appears 25

So have we proved our premise? Given that we have already hammered home the lack of science here, we leave it to you, gentle reader, to decide if there is enough ‘evidence’ here.

Of course, if we were being fair, we would (as the original analysis did) take a look at the comparable Joe Biden transcripts. But we aren’t trying to be fair or unfair. We aren’t trying to score any particular political points; we are trying to show that BI techniques can be applied to any data, not just business data. Which brings us to you.

If you can answer “yes” to these three questions, then what are you waiting for? Hurry to the BI bonanza and get mining.

However, although BI is broadly applicable, there are data sets to which it would be entirely inappropriate to apply these techniques - for example, the work of fine, upstanding journalists such as those employed at Vulture Central. For reasons that are too technical to go into here, this data is not amenable to analysis of this kind. Which is a shame, because we know that we are always consistent. ®