I’m always excited to see NLP research featured in the mass media because it closes the gap between what my mom thinks I do and what I actually do. The most recent buzz has been for work from Kevin Knight, his students at the Information Sciences Institute (ISI) at the University of Southern California (USC), and their collaborators in The Department of Linguistics at Uppsala Sweden. The used techniques from NLP and Machine Translation to decipher a centuries old secret code known as The Copiale Cipher. The approach consisted of scanning in the 105 page manuscript, creating machine readable transliterations of the symbols, and using a bag of statistical tricks to crack the code. By using clustering they were able to find common subsequences, which they could then use to find the most likely source language. After lots of failed attempts, they made breakthroughs with German and ultimately found the manuscript described lots of rituals with a strange emphasis on eyeballs, plucked eyebrows, and ophthalmology. Sadly, this paragraph does not do justice to all the labor and technicalities they had to solve to get a result.
For more details you can read their preliminary results in their paper at this past June’s workshop on Building Comparable and Comparable Corpora hosted at the annual conference of the Association of Computational Linguistics (ACL). To learn more about the corpus visit the project page.
For more a more accessible, higher-level overview, I suggest reading any of these articles:
- MSNBC – Secret society’s code cracked (includes a video)
- Arstechnica – Translation algorithms used to crack centuries-old secret code
- Discover Magazine – Finally Mysterious Cipher Code Cracked
- PhysOrg.com – Computer scientist cracks mysterious ‘Copiale Cipher’