Category Archives: Random Musings

Helping the spammers

As I was updating the site, I took notice of all the ways that people (myself included) attempt to obfuscate their e-mail addresses to avoid getting their e-mail picked up by the spammers diligently crawling the web.  You often see variations like

  • first.last@domain.com
  • first@lastname.com
  • first[dot]last@domain.com
  • <first_initial><last_name>@domain.com

Even though the extra layer of HTML makes the already obfuscated addresses above into gobbledygook like &lt;first_initial&gt;&lt;last_name&gt;@domain.com, a motivated programmer with enough time and patience could create a list of regular expressions to handle just about anything.

However,  spamming is not a precise art.  There is no need to get every last e-mail address, and odds are the people that list their e-mail addresses like the ones above are not going to fall for any e-mail scam.  If I were a spammer, I would want to exploit the existing lists to identify known name combinations.  Likewise I would imagine that most people would say they are already doing that.  I would argue that going through all the combinations of first names is probably quite fruitful, but it also produces way more non-existing combinations than real ones.

Alternative, one could use these same lists in conjunction with a list of domain addresses and a simple greedy algorithm known as maxmatch to get known good combinations of first and last names.  The premise is simple, the algorithm goes from left to right in the address looking for the longest matches within the names in the list.

The code looks something like this:

def max_match(word, word_list) :
    start = 0
    words = list()
    last_match = None

    while start < len(word) :
        match = False
        for i in range(len(word), 0, -1) :
            if (word[start:i] in word_list) :
                words.append(word[start:i])
                match = True
                start = i
                break
        if not match :
            words.append(word[start])
            start += 1
    return words

When used for word segmentation or hashtag segmentation, maxmatch usually falls victim to its greediness because English has a large number of small words that are easily consumed by larger words. Borrowing an example from my professor Jim Martin, “thetabledownthere” would get segmented into “theta bled own there”. With names this should be less of an issue, since most of them are at least four characters and the combinations tend to be less common than normal English words. For example with my name, there is little chance it would get segmented incorrectly, unless leeb was a name.

Of course, this is still probably more trouble than whatever the spammers are already doing.

Me Sense Disambiguation

As you can probably guess from my URL and the list of links on the sidebar, you’ll quickly realize I am somewhat obsessed about disambiguating myself from the masses on the interwebs who share my somewhat common name.  While, I’m no Joe Smith or David Johnson, this task still presents challenges, especially when you consider that there was a Lee Becker at in the department of Computer Science at Worcester Polytech who specialized in artificial intelligence and slavic linguistics.  Further complicating matters there is another one of us just down the road at the University of Colorado, Colorado Springs in the department of Psychology.  If you trust my Google first-page, selection bias, you’d be convinced that the Lee Beckers of the world must all be interested in some combination of computer science, linguistics, and cognitive science.  But a quick search on LinkedIn will show that we do plenty of other things.  Furthermore, based on a series of wrongly addressed e-mails, I know there are one or more Lee Beckers in New York who are somehow affiliated with sports and building management.

While this task is called named-entity disambiguation in natural language processing, I like to draw a parallel to verb sense disambiguation for my own situation.  With many verbs, the dominant will account for most of its occurrences.  For example sense-1 of work refers to the act of getting something done, but occasionally it gives way to sense-2,  to knead or fold, like with bread.  My megalomania wants the Silat-loving, computational linguist from Colorado to be sense-1, but I think I’m going to need to be more prolific and more diligent before this is going to happen.

Do other people obsess about this?  Perhaps this is all a byproduct of not having a middle name.

Named Entity Recognition

Hi Wayne,
I’ve been trying to figure out an appropriate information model (most likely XML-based) to correspond to my annotation schema, as I have started to form my notions of how this should look to allow for future expansion, ease of use when annotating, and accessibility for feature extraction, I’m kind of rethinking how annotation of the thematic roles should look.
In the annotations we’ve done so far, we’ve been labeling sections of the text with labels like agent, patient, theme, etc.  However, in past discussions we’ve both come to the conclusion that this should be produced by a statistical semantic role labeler with arguments mapped to something like VerbNet classes.
To think through the possible annotation, I’ve started playing around a bit with ASSERT, and have come to the realization that its spans are no where as detailed as how I have been annotating, and I may need to back off things a bit.
Take this example:
S0: <new_speaker_male_1> we’ve been learning a little bit about <um> how electricity conducts through a battery to a light bulb
T9: You said you’ve been learning about electricity and lightbulbs
T10: Tell me more about that.
If I run the first statement through ASSERT I get the following parses:
>new_speaker_male_1< [ARG0 we] ‘ve been [TARGET learning ] [ARG1 a little bit about >um<] how electricity conducts through a battery to a light bulb
>new_speaker_male_1< we ‘ve been learning a little bit about >um< [ARGM-MNR how] [ARG1 electricity] [TARGET conducts ] through a battery to a light bulb
Notice first of all how the entities marked by the tutor do not correspond exactly to the any of the arguments for any of the predicates.  How would a link annotation look in this instance for the marking act at turn T9?  Would it be best to advise the annotators to select the argument (role) closest to the one present, and if none fit, make no link?  Or would something else be more appropriate.
Secondly, do you think it would be better to annotate the links directly to the PropBank argument structure, or would it be better to see if we could translate the arguments into a VerbNet role first and then do the linking?
I know these are kind of mundane details, but I think I need to work through them to really finalize on the annotation approach.
Thanks,
Lee

When one starts thinking of famous names in the speech and natural language process world, names like Jurafsky, Joshi, or Jelenek come to mind.  These names are pretty much universally recognized, and chances are, if you work in NLP you have either met them yourself, or work with someone who has.  However, there is another well known name in the community, but chances are slim that anybody who works in NLP has actually met this person.  At the same time almost anyone who has done work in parsing or semantic role labeling might be able to tell you an age they associate with his name.

Who is this person, and why would anybody know this?  The answer comes from an artifact of NLP history. Some time in the early 90’s Mitch Marcus and others at the University of Pennsylvania obtained a million words of 1989 Wall Street Journal material. The first two sentences of this corpus are:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .

By 1992, this text had been hand labeled with part of speech tags, and syntactic parse structure. By 2005 Martha Palmer and Mitch Marcus had led an effort to add semantic information in the form of predicate argument structure on top of the existing treebank. More recently, the Conference of Natural Language Learning converted this treebank to dependency parse structures. What does this all mean? In short it means that nearly every English based statistical parser and semantic role labeler has been trained on this data, and with that training comes debugging, which inevitably leads many to read about the illustrious Pierre Vinken.

But who is Pierre Vinken? Searches on the web yield little additional information. Many of the search results are confused by his legacy in the Penn Treebank. Others are in Dutch. Some point to book he has written. The most insight comes from a brief paragraph from an article titled “THe HIstory and Heritage of Science Information Systems”, which strangely enough is hosted on a University of Pennsylvania library site.

Another non-traditional information pioneer I should mentionis Pierre Vinken. A neurosurgeon and editor, I met him in the 1950s whenthe Excerpta Medica Foundation was established. He converted this to a commercial enterprise which has become one of the world’s largest publishing conglomerates — Reed Elsevier.

Given the sparseness of  information about him, and the fact that over 20 years has elapsed since the aforementioned sentences were published, I sometimes wonder if Mr. Vinken is still alive, and if he is, Idoes he knows of his role in the world of computational linguistics.