Tag Archives: NLP

Question Generation: mind-the-gap corpus 1.0

During my internship at Microsoft Research last summer, Sumit Basu, Lucy Vanderwende and I developed a system for generating quizzes from arbitrary text data.
Now that we have presented the paper at NAACL:HLT 2012, Microsoft is making the corpus we compiled to enable these experiments publicly available for download.
For more details about the corpus visit the the MindTheGap-1.0 corpus home page.

NLP in the Media: The Copiale Cipher

I’m always excited to see NLP research featured in the mass media because it closes the gap between what my mom thinks I do and what I actually do. The most recent buzz has been for work from Kevin Knight, his students at the Information Sciences Institute (ISI) at the University of Southern California (USC), and their collaborators in The Department of Linguistics at Uppsala Sweden. The used techniques from NLP and Machine Translation to decipher a centuries old secret code known as The Copiale Cipher. The approach consisted of scanning in the 105 page manuscript, creating machine readable transliterations of the symbols, and using a bag of statistical tricks to crack the code. By using clustering they were able to find common subsequences, which they could then use to find the most likely source language. After lots of failed attempts, they made breakthroughs with German and ultimately found the manuscript described lots of rituals with a strange emphasis on eyeballs, plucked eyebrows, and ophthalmology. Sadly, this paragraph does not do justice to all the labor and technicalities they had to solve to get a result.

For more details you can read their preliminary results in their paper at this past June’s workshop on Building Comparable and Comparable Corpora hosted at the annual conference of the Association of Computational Linguistics (ACL). To learn more about the corpus visit the project page.

For more a more accessible, higher-level overview, I suggest reading any of these articles:

Merantau: a Silat player’s review

After several months of waiting, I finally saw Merantau[1], an Indonesian language, martial arts flick.  When watching the trailers it was billed as kind  of an Ong Bak, but with Muay Thai swapped for Pencak Silat.  Merantau more than delivered on this premise.

In terms of story, Merantau does very little to differentiate itself.  A boy from the country goes to the city and encounters evil men doing evil things to the vulnerable, and through gifts of well placed punches and kicks, he remedies the situation. Almost stereotypically, the main villain is a white guy with a bad temper who mistreats the women he plans to sell into prostitution.  Additionally, blood and gore were overused to little effect.[2]

Nonetheless…While the plot is overly familiar and the acting is not altogether memorable, director Gareth Evans breaks ground in a much more interesting manner.  This film marks the first time I’ve seen real Pencak Silat in a movie.   Most Indonesian movies seem to forget the beauty and richness of their native arts, and typically present fight scenes using choreography that looks more like karate or tae kwon do.  While those are respectable arts in their own right, they are not Silat, and it shows.  Though Merantau has a good number of Hollywood (Bali-wood?) touches with gigantic leaping kicks and improbable knockouts, it manages to stay grounded in its Silat core.   From the opening jurus scene to the finale, Silat practitioners will recognize many of the locks, traps, and manipulations that make the art so deadly and effective.  I took great pleasure in seeing foot traps and puter kepalas [3] subtly woven into the choreography.  Moreover Iko Uwais does an incredible job of making us believe Yuda, the protagonist, is truly a pendekar[4] of Silat Minangkabau (a style from Western Sumatra well known whose movements are well known throughout Indonesia).  Most importantly, Evans’ cinematography allows the viewer the fully appreciate the timing and fluidity of this choreography without getting overwhelmed with strange camera angles and slow motion effects.  For the most part its pure, enjoyable, unadulterated hand-to-hand combat.

I am hoping this movie will start an upward trend of Pencak Silat in movies.  If all goes well, someone will one day make an epic set in Dutch colonial times featuring not just Silat Minang, but also Silat Madura, Silat Cimande, Silat Mataram, and even Chinese Kuntao,

[1] The word “merantau” is roughly translated as “to wander about” or “to go abroad”  It plays a central role in Minangkabau culture, as inheritance is passed down matrilinearly (i.e. from woman to woman) and a man must go out into the world and earn his keep before returning to his homeland in Western Sumatra.  I believe this practice explains the multitude of Padang-style eateries across the archipelago.

[2]This review reminds me of why sentiment analysis is such a hard proposition.  I managed to state both negative and positive aspects in the same review in a manner that makes it nearly impossible for any algorithms to tease apart in a principled manner.  This little meta-blurb at the bottom probably doesn’t help as well.

[3] Puter = turn, Kepala = head

[4] Pendekar = master of martial arts

Named Entity Recognition

Hi Wayne,
I’ve been trying to figure out an appropriate information model (most likely XML-based) to correspond to my annotation schema, as I have started to form my notions of how this should look to allow for future expansion, ease of use when annotating, and accessibility for feature extraction, I’m kind of rethinking how annotation of the thematic roles should look.
In the annotations we’ve done so far, we’ve been labeling sections of the text with labels like agent, patient, theme, etc.  However, in past discussions we’ve both come to the conclusion that this should be produced by a statistical semantic role labeler with arguments mapped to something like VerbNet classes.
To think through the possible annotation, I’ve started playing around a bit with ASSERT, and have come to the realization that its spans are no where as detailed as how I have been annotating, and I may need to back off things a bit.
Take this example:
S0: <new_speaker_male_1> we’ve been learning a little bit about <um> how electricity conducts through a battery to a light bulb
T9: You said you’ve been learning about electricity and lightbulbs
T10: Tell me more about that.
If I run the first statement through ASSERT I get the following parses:
>new_speaker_male_1< [ARG0 we] ‘ve been [TARGET learning ] [ARG1 a little bit about >um<] how electricity conducts through a battery to a light bulb
>new_speaker_male_1< we ‘ve been learning a little bit about >um< [ARGM-MNR how] [ARG1 electricity] [TARGET conducts ] through a battery to a light bulb
Notice first of all how the entities marked by the tutor do not correspond exactly to the any of the arguments for any of the predicates.  How would a link annotation look in this instance for the marking act at turn T9?  Would it be best to advise the annotators to select the argument (role) closest to the one present, and if none fit, make no link?  Or would something else be more appropriate.
Secondly, do you think it would be better to annotate the links directly to the PropBank argument structure, or would it be better to see if we could translate the arguments into a VerbNet role first and then do the linking?
I know these are kind of mundane details, but I think I need to work through them to really finalize on the annotation approach.

When one starts thinking of famous names in the speech and natural language process world, names like Jurafsky, Joshi, or Jelenek come to mind.  These names are pretty much universally recognized, and chances are, if you work in NLP you have either met them yourself, or work with someone who has.  However, there is another well known name in the community, but chances are slim that anybody who works in NLP has actually met this person.  At the same time almost anyone who has done work in parsing or semantic role labeling might be able to tell you an age they associate with his name.

Who is this person, and why would anybody know this?  The answer comes from an artifact of NLP history. Some time in the early 90’s Mitch Marcus and others at the University of Pennsylvania obtained a million words of 1989 Wall Street Journal material. The first two sentences of this corpus are:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .

By 1992, this text had been hand labeled with part of speech tags, and syntactic parse structure. By 2005 Martha Palmer and Mitch Marcus had led an effort to add semantic information in the form of predicate argument structure on top of the existing treebank. More recently, the Conference of Natural Language Learning converted this treebank to dependency parse structures. What does this all mean? In short it means that nearly every English based statistical parser and semantic role labeler has been trained on this data, and with that training comes debugging, which inevitably leads many to read about the illustrious Pierre Vinken.

But who is Pierre Vinken? Searches on the web yield little additional information. Many of the search results are confused by his legacy in the Penn Treebank. Others are in Dutch. Some point to book he has written. The most insight comes from a brief paragraph from an article titled “THe HIstory and Heritage of Science Information Systems”, which strangely enough is hosted on a University of Pennsylvania library site.

Another non-traditional information pioneer I should mentionis Pierre Vinken. A neurosurgeon and editor, I met him in the 1950s whenthe Excerpta Medica Foundation was established. He converted this to a commercial enterprise which has become one of the world’s largest publishing conglomerates — Reed Elsevier.

Given the sparseness of  information about him, and the fact that over 20 years has elapsed since the aforementioned sentences were published, I sometimes wonder if Mr. Vinken is still alive, and if he is, Idoes he knows of his role in the world of computational linguistics.