I’ve been trying to figure out an appropriate information model (most likely XML-based) to correspond to my annotation schema, as I have started to form my notions of how this should look to allow for future expansion, ease of use when annotating, and accessibility for feature extraction, I’m kind of rethinking how annotation of the thematic roles should look.
In the annotations we’ve done so far, we’ve been labeling sections of the text with labels like agent, patient, theme, etc. However, in past discussions we’ve both come to the conclusion that this should be produced by a statistical semantic role labeler with arguments mapped to something like VerbNet classes.
To think through the possible annotation, I’ve started playing around a bit with ASSERT, and have come to the realization that its spans are no where as detailed as how I have been annotating, and I may need to back off things a bit.
Take this example:
S0: <new_speaker_male_1> we’ve been learning a little bit about <um> how electricity conducts through a battery to a light bulb
T9: You said you’ve been learning about electricity and lightbulbs
T10: Tell me more about that.
If I run the first statement through ASSERT I get the following parses:
>new_speaker_male_1< [ARG0 we] ‘ve been [TARGET learning ] [ARG1 a little bit about >um<] how electricity conducts through a battery to a light bulb
>new_speaker_male_1< we ‘ve been learning a little bit about >um< [ARGM-MNR how] [ARG1 electricity] [TARGET conducts ] through a battery to a light bulb
Notice first of all how the entities marked by the tutor do not correspond exactly to the any of the arguments for any of the predicates. How would a link annotation look in this instance for the marking act at turn T9? Would it be best to advise the annotators to select the argument (role) closest to the one present, and if none fit, make no link? Or would something else be more appropriate.
Secondly, do you think it would be better to annotate the links directly to the PropBank argument structure, or would it be better to see if we could translate the arguments into a VerbNet role first and then do the linking?
I know these are kind of mundane details, but I think I need to work through them to really finalize on the annotation approach.
When one starts thinking of famous names in the speech and natural language process world, names like Jurafsky, Joshi, or Jelenek come to mind. These names are pretty much universally recognized, and chances are, if you work in NLP you have either met them yourself, or work with someone who has. However, there is another well known name in the community, but chances are slim that anybody who works in NLP has actually met this person. At the same time almost anyone who has done work in parsing or semantic role labeling might be able to tell you an age they associate with his name.
Who is this person, and why would anybody know this? The answer comes from an artifact of NLP history. Some time in the early 90’s Mitch Marcus and others at the University of Pennsylvania obtained a million words of 1989 Wall Street Journal material. The first two sentences of this corpus are:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
By 1992, this text had been hand labeled with part of speech tags, and syntactic parse structure. By 2005 Martha Palmer and Mitch Marcus had led an effort to add semantic information in the form of predicate argument structure on top of the existing treebank. More recently, the Conference of Natural Language Learning converted this treebank to dependency parse structures. What does this all mean? In short it means that nearly every English based statistical parser and semantic role labeler has been trained on this data, and with that training comes debugging, which inevitably leads many to read about the illustrious Pierre Vinken.
But who is Pierre Vinken? Searches on the web yield little additional information. Many of the search results are confused by his legacy in the Penn Treebank. Others are in Dutch. Some point to book he has written. The most insight comes from a brief paragraph from an article titled “THe HIstory and Heritage of Science Information Systems”, which strangely enough is hosted on a University of Pennsylvania library site.
Another non-traditional information pioneer I should mentionis Pierre Vinken. A neurosurgeon and editor, I met him in the 1950s whenthe Excerpta Medica Foundation was established. He converted this to a commercial enterprise which has become one of the world’s largest publishing conglomerates — Reed Elsevier.
Given the sparseness of information about him, and the fact that over 20 years has elapsed since the aforementioned sentences were published, I sometimes wonder if Mr. Vinken is still alive, and if he is, Idoes he knows of his role in the world of computational linguistics.