Category Archives: NLP

Question Generation: mind-the-gap corpus 1.0

During my internship at Microsoft Research last summer, Sumit Basu, Lucy Vanderwende and I developed a system for generating quizzes from arbitrary text data.
Now that we have presented the paper at NAACL:HLT 2012, Microsoft is making the corpus we compiled to enable these experiments publicly available for download.
For more details about the corpus visit the the MindTheGap-1.0 corpus home page.

NLP in the Media: The Copiale Cipher

I’m always excited to see NLP research featured in the mass media because it closes the gap between what my mom thinks I do and what I actually do. The most recent buzz has been for work from Kevin Knight, his students at the Information Sciences Institute (ISI) at the University of Southern California (USC), and their collaborators in The Department of Linguistics at Uppsala Sweden. The used techniques from NLP and Machine Translation to decipher a centuries old secret code known as The Copiale Cipher. The approach consisted of scanning in the 105 page manuscript, creating machine readable transliterations of the symbols, and using a bag of statistical tricks to crack the code. By using clustering they were able to find common subsequences, which they could then use to find the most likely source language. After lots of failed attempts, they made breakthroughs with German and ultimately found the manuscript described lots of rituals with a strange emphasis on eyeballs, plucked eyebrows, and ophthalmology. Sadly, this paragraph does not do justice to all the labor and technicalities they had to solve to get a result.

For more details you can read their preliminary results in their paper at this past June’s workshop on Building Comparable and Comparable Corpora hosted at the annual conference of the Association of Computational Linguistics (ACL). To learn more about the corpus visit the project page.

For more a more accessible, higher-level overview, I suggest reading any of these articles: