Codebreaker II Our winner reveals all
Mark Wutka explains his attack
In response to readers' requests for our Codebreaker II supremo, Mark Wutka, to explain how he attacked the problem, we are pleased to print the following from the man himself. Enjoy:
When I first saw the Codebreaker II Challenge, I did what any cryptanalyst would do with an unknown cipher - I counted the occurrences of each letter. In normal English text, the letters ETAONIRSH occur most frequently, with E appearing about 13% of the time. You also expect letters like J, Q, X and Z (the high-point Scrabble letters) to appear very infrequently, if at all.
Because Codebreaker II was 587 letters long, you might expect to see some of the low-frequency letters appear once or twice. When I counted the letters, however, I saw the following numbers: Q:28 Y:28 G:28 R:28 U:28 T:27 A:27 I:25 H:25 C:23 F:23 V:22 X:22 W:22 P:21 E:21 S:21 L:21 N:20 Z:20 O:20 B:19 K:19 J:18 D:17 M:13. Notice that all 26 letters appear, and that the lowest frequency is roughly half the highest one. The highest value should have been around 76 (13% of 587). Obviously, this was not a simple substitution.
Instead, it was likely to be a polyalphabetic cipher where a letter can represent different letters at different times. In conversations on the rec.puzzles newsgroup, Jim Gillogly suggested several possibilities and ruled out others. For example, many polyalphabetic ciphers use a 5x5 square to encrypt letters, using the same letter to encrypt both I and J. In this cipher, however, all 26 letters appear, so it couldn't be any of those (like Playfair, for instance).
Jim suggested something in the Vigenere family of ciphers, which use a set of 26 different alphabets, or possibly a mechanical encryption like Enigma. Enigma was an obvious choice since The Register is in the U.K. and Britain is very proud (rightly so) of their work on breaking the Enigma during World War II. At one point, I had been very close to the correct settings. I guessed that the set of rotors might be a permutation of the submission date (04/05/01) and that the initial position for the rings might be REG, but I never would have guessed the 23/06/12 (Alan Turing's birthday) for the ring setting.
It was at this point that Jim Gillogly was too smart for his own good. He reasoned that if this was an Enigma encryption, then the letters ETAONIRSH should appear less frequently than the others because the Enigma never encrypts a letter as itself. That is, if you type an E into an Enigma machine, you'll get anything BUT an E as the output. Jim performed tests on several English texts to verify his hypothesis, and it looked correct. We never suspected that the text might have been encrypted some other way first. Jim is an expert at classical cryptography and has written a program that is quite good at cracking Enigma cipher.
After the smoke cleared, he ran his program on the Codebreaker II and found that it did, in fact, figure out the correct settings, including the plugboard. He also wrote a paper describing his Enigma cracking technique, which is available here. After incorrectly ruling out Enigma, we floundered around with variations of Vigenere ciphers, but obviously never got anywhere. We finally got discouraged and realized that we desperately needed a clue.
When clue time arrived, I immediately recognized the clues as Enigma settings. I didn't know much about the internals of the Enigma when the contest started but by the time the clue arrived, I had read several papers describing its operation. I grabbed an Enigma simulator off the net and decrypted the text. The SO FAR SO GOOD at the end was the only clue that I had done it right, although the supplementary encrypted text also told me that I had the right settings.
Now that I had this newly decrypted text, I went back to the beginning and did a frequency count again. This time, it was much more in line with a simple substitution cipher: U:77 L:57 F:51 D:51 S:50 M:37 K:37 N:36 J:27 O:24 W:21 B:21 X:18 T:17 P:14 Q:11 C:10 E:9 A:5 G:5 R:5 H:2 V:1. Notice that U appears 77 times, just a hair over 13% of the time. I assumed that this was English text and substituted E for U and T for L. I also employed another technique which is to count the number of times various letters contact each other. In the case of L (which I assumed was T), I noticed that it came before M 19 times. I figured that M must represent H, because H frequently appears after T.
At this point, I noticed several occurrences of TH T (where the blank was an S in the ciphertext, making me think that S must represent A. By this time, I was beginning to think that I was wrong about M representing H, because I saw HH appear several times, and you don't normally see that in English text. I did notice, however, that HATH occurred several times, so I got the impression that this was old English.
Next, I looked for the pattern T THE, expecting TO THE to occur somewhere, but I didn't find it. Going back to the contact frequencies, I noticed that J appeared after U 9 times - the most frequent letter to appear after U. In English, the letter that most frequently comes after E is R, so I guessed that J represented R. Now, looking at the beginning of the cryptogram, I had A TER, with a T representing the blank spot. I guessed that T represented F. Next, I looked for the F R pattern, figuring that FOR must appear somewhere. I saw several places, each with F representing the blank spot. I guessed that F must represent O.
At this point, I noticed a piece of a phrase - THATTHEROOTEOFA - that really made me realize I was on the right track. First, I began to suspect that this was something like Chaucer, because of "roote", which you may remember if you ever had to memorize the opening to the Canterbury Tales, that included the line "The Droute of March hath perced to the roote". Since the next two letters after A were the same letter (B), I guessed that B was probably L. I also noticed FOR OOTH, which certainly looked like it should be FORSOOTH, so I substituted S for K. The introduction of S into the text also brought about another piece of phrase that looked solveable - THATHESO LEHATHLOSTTHE. Since ROOT was spelled ROOTE, I thought maybe SOUL was spelled SOULE and substituted U for the P that represented the blank spot.
Now, all is fair in love, war, and Codebreaker challenges, so at this point, I decided to bring the Internet to bear on the problem. I went to www.altavista.com and typed +"that the soule hath lost the" (Actually, I tried Google first, but because it always breaks up the words, it didn't find anything that looked like a match). Altavista gave me three results, all containing Chaucer's "The Parson's Tale". After that, it was a simple matter of completing the substitutions.
Had I not taken that shortcut, my probable next step would have been to substitute ND for DW, since I saw FOU and would assume that it was probably FOUND. You would then be able to solve the rest quite easily.
This was a fun and challenging puzzle and I look forward to see what torture Lester dreams up for Codebreaker III!
If you are interested in cryptograms of all sorts, you might consider joining the American Cryptogram Association. Although the name says "American", they welcome members from all nationalities (although most of their cryptograms are in English).
I'm the author of Special Edition Using Java Server Pages and Servlets, and also Special Edition Using Java 2 Enterprise Edition. I can be reached here.
Thanks to Mark for sharing that with us. You can find the Codebreaker II solution here.
Sponsored: RAID: End of an era?