Tag Archives: word segmentation

Helping the spammers

As I was updating the site, I took notice of all the ways that people (myself included) attempt to obfuscate their e-mail addresses to avoid getting their e-mail picked up by the spammers diligently crawling the web.  You often see variations like

  • first.last@domain.com
  • first@lastname.com
  • first[dot]last@domain.com
  • <first_initial><last_name>@domain.com

Even though the extra layer of HTML makes the already obfuscated addresses above into gobbledygook like &lt;first_initial&gt;&lt;last_name&gt;@domain.com, a motivated programmer with enough time and patience could create a list of regular expressions to handle just about anything.

However,  spamming is not a precise art.  There is no need to get every last e-mail address, and odds are the people that list their e-mail addresses like the ones above are not going to fall for any e-mail scam.  If I were a spammer, I would want to exploit the existing lists to identify known name combinations.  Likewise I would imagine that most people would say they are already doing that.  I would argue that going through all the combinations of first names is probably quite fruitful, but it also produces way more non-existing combinations than real ones.

Alternative, one could use these same lists in conjunction with a list of domain addresses and a simple greedy algorithm known as maxmatch to get known good combinations of first and last names.  The premise is simple, the algorithm goes from left to right in the address looking for the longest matches within the names in the list.

The code looks something like this:

def max_match(word, word_list) :
    start = 0
    words = list()
    last_match = None

    while start < len(word) :
        match = False
        for i in range(len(word), 0, -1) :
            if (word[start:i] in word_list) :
                match = True
                start = i
        if not match :
            start += 1
    return words

When used for word segmentation or hashtag segmentation, maxmatch usually falls victim to its greediness because English has a large number of small words that are easily consumed by larger words. Borrowing an example from my professor Jim Martin, “thetabledownthere” would get segmented into “theta bled own there”. With names this should be less of an issue, since most of them are at least four characters and the combinations tend to be less common than normal English words. For example with my name, there is little chance it would get segmented incorrectly, unless leeb was a name.

Of course, this is still probably more trouble than whatever the spammers are already doing.