Automated Solution of Simple Substitution Ciphers
About a year ago, I was working on the solutions to the cryptograms in the first chapter of the Mathematical Cryptography book, and the idea struck me that it would be nice to have a computer program that could automatically solve such puzzles. It turns out that this is easier said than done, especially if the spaces between words are not preserved in the ciphertext and the puzzle author took pains to manipulate the character frequencies of the underlying message.
Exercise 1.4 (c) from the book is a good example. I urge you to try to solve it by hand in order to get an appreciation for the nature of the problem. It is not easy! Then think about how you would teach a computer to perform this feat automatically.
GSZES GNUBE SZGUG SNKGX CSUUE QNZOQ EOVJN VXKNG XGAHS
AWSZZ BOVUE SIXCQ NQESX NGEUG AHZQA QHNSP CIPQA OIDLV
JXGAK CGJCG SASUB FVQAV CIAWN VWOVP SNSXV JGPCV NODIX
GJQAE VOOXC SXXCG OGOVA XGNVU BAVKX QZVQD LVJXQ EXCQO
VKCQG AMVAX VWXCG OOBOX VZCSO SPPSN VAXUB DVVAX QJQAJ
VSUXC SXXCV OVJCS NSJXV NOJQA MVBSZ VOOSH VSAWX QHGMV
GWVSX CSXXC VBSNV ZVNVN SAWQZ ORVXJ CVOQE JCGUW NVA
Over the past 25 years, many papers have been published on the subject in various journals, and many interesting approaches have been proposed. It may be a surprise for some to learn that even state-of-the-art algorithms are not capable of reliably solving very short cryptograms (under 200 characters in length), especially when word divisions are eliminated.
My own attempt, building on work done by Prof. Richard Spillman (who first proposed attacking the problem via genetic algorithms) and Sam Hasinoff (who first suggested using character n-gram models), is an open source program called Alkindus. It works quite well, and is easily capable of solving puzzles similar to the ones from Mathematical Cryptography in a fully automated manner.
The work-in-progress paper describing the theory and empirical results is in need of a large collection of cryptograms completely independent of the Project Gutenberg corpus of English-language text that I used to train the underlying n-gram model. To use cryptograms derived from the training data itself would be cheating! So, I am especially interested in obtaining large amounts of English text (varying from say, 50 to 2500 characters in length), which could be used to put the program through a proper, rigorous test. If you have any suggestions, please let me know.