I just wanted to announce that there is now a meetup for people in the Front Range who are interested in all things related to Natural Language Processing. If you’re interested be sure to join the group, and if you have ideas for possible topics, feel free to share. Hopefully this grows into a venue where people can come to share their knowledge and discuss ideas relating to NLP.
Back when the President’s of the United States of America song “Peaches” got lots of air time, I would sub my name in for the word “me”:
Millions of peaches, peaches for Lee
Millions of peaches, peaches for free
Millions of peaches, peaches for Lee
Millions of peaches, peaches for free
With my mom’s abundant peach harvest this year, I find that this chorus perfectly describes my current situation. While my brother has opted to eat on the order of 8 peaches a day, I’ve been looking for different vectors for peach consumption. In the past week I’ve made peach, rum, and vanilla topping for ice cream, and a peach and chili-cayenne barbecue sauce. Last night, I was attending an end-of-summer barbecue and I thought using these peaches to make a bruschetta would be nice way to whittle down my ever-ripening supply of this fuzzy fruit.
Grilled Peach and Tomato Bruschetta
- 1 package soft tofu
- 4-6 peaches
- 3-4 medium tomatoes
- 1/4 red onion
- basil – preferably fresh, but dry works too
- balsamic vinegar
- olive oil
- lemon juice
- garlic powder
- honey (can be replaced with other sweetener)
- 1 baguette
To prepare, drain and dry tofu. Place in bowl and mash with hands or fork. Add in basil, lemon juice (2-3 tbsp), garlic powder and salt to taste. Drizzle in a small bit of olive oil. Mix until it gets a ricotta-like consistency.
Peach / Tomato Topping
Halve peaches, and red onion. Brush with olive oil and place on grill for a few minutes – just long enough to get some nice grill marks. Once off grill, remove skins from peaches, and let cool in refrigerator for 1-2 hours.
In a new bowl mix chopped peaches and tomatoes and minced onion. Add a light drizzling of olive oil and balsamic vinegar. Add honey, basil, salt, and pepper to taste.
Slice baguette into half inch slices. Brush both sides with olive oil and place in grill to toast. If you don’t have a grill, a toaster or the broiler setting in your oven can work. Be sure to flip both sides, and toast until golden brown.
Spread tofu mixture on baguette slices, top with a spoonful of peach topping and serve.
If you’re feeling especially ambitious, make a balsamic vinegar reduction and lightly drizzle over the assembled product.
During my internship at Microsoft Research last summer, Sumit Basu, Lucy Vanderwende and I developed a system for generating quizzes from arbitrary text data.
Now that we have presented the paper at NAACL:HLT 2012, Microsoft is making the corpus we compiled to enable these experiments publicly available for download.
For more details about the corpus visit the the MindTheGap-1.0 corpus home page.
I’m always excited to see NLP research featured in the mass media because it closes the gap between what my mom thinks I do and what I actually do. The most recent buzz has been for work from Kevin Knight, his students at the Information Sciences Institute (ISI) at the University of Southern California (USC), and their collaborators in The Department of Linguistics at Uppsala Sweden. The used techniques from NLP and Machine Translation to decipher a centuries old secret code known as The Copiale Cipher. The approach consisted of scanning in the 105 page manuscript, creating machine readable transliterations of the symbols, and using a bag of statistical tricks to crack the code. By using clustering they were able to find common subsequences, which they could then use to find the most likely source language. After lots of failed attempts, they made breakthroughs with German and ultimately found the manuscript described lots of rituals with a strange emphasis on eyeballs, plucked eyebrows, and ophthalmology. Sadly, this paragraph does not do justice to all the labor and technicalities they had to solve to get a result.
For more details you can read their preliminary results in their paper at this past June’s workshop on Building Comparable and Comparable Corpora hosted at the annual conference of the Association of Computational Linguistics (ACL). To learn more about the corpus visit the project page.
For more a more accessible, higher-level overview, I suggest reading any of these articles:
- MSNBC – Secret society’s code cracked (includes a video)
- Arstechnica – Translation algorithms used to crack centuries-old secret code
- Discover Magazine – Finally Mysterious Cipher Code Cracked
- PhysOrg.com – Computer scientist cracks mysterious ‘Copiale Cipher’
I spent a good part of this evening (can you call it evening when it’s well past midnight?) trying to learn how to do a simple Java-based query using Amazon’s Product Advertising API. This entire exercise could have been finished within an hour had there been clear, concise documentation. Instead of one recipe for success, Amazon presents a dizzying array of approaches that lead to many dead ends.
In hopes of saving others from encountering these difficulties, I present you with my findings. For those of you that just want the recipe. Jump to the bottom.
Lesson 1 – Don’t Trust the Getting Started Guide
Admittedly the getting started guide gives a useful overview of the functionality and the approach to using the API. However, following the instructions step by step will only lead to frustration. The Java example for
Implementing a Product Advertising API Request DOES NOT WORK!. One would think that simply swapping in the proper access key where it says “YOUR ID” would be all that is needed, but upon execution I found it yields the following:
Exception in thread "main" com.sun.xml.internal.ws.client.ClientTransportException: The server sent HTTP status code 400: Bad Request
Thinking I had omitted something small, I looked into resolving this error only to discover:
Lesson 2 – APA API’s Authentication has changed
As of August 15, 2009, the API requires a signature mechanism for authentication. In addition to invalidating all the code from the getting started guide, it also adds additional poorly documented steps to the process. Amazon does provide some detail, but it’s probably not the quickest path to get up and running.
Lesson 3 – There are some semi-functional examples
After digging around in the documentation, I found these two examples: Product Advertising API Signed Requests Sample Code – Java REST/QUERY and Product Advertising API Signed Requests Sample Code – Java SOAP. Since everything I had tried up until this point had been SOAP-centric, I decided to try the SOAP example first. Upon the code into Eclipse, I found that this example was fraught with too many errors and dependencies, so I turned to the REST example.
The REST code was clear and mostly error free. The few errors I saw were caused by the absence of the Apache Commons Codec library. After downloading this jar and adding it to my classpath, the example code finally compiled. Unfortunately, when I went to run it, I was greeted with this exception:
Server returned HTTP response code: 403 for URL: http://ecs.amazonaws.com/onca/xml?....
Lesson 4 – Apache Commons Codec 1.3 and 1.4 are different
After crawling through the forums looking for answers, I found out that the REST example above depended on Apache Commons Codec version 1.3, whereas the version I downloaded was 1.4. It turns out the old version appended extra CRLF (\r\n) characters onto the authentication signature, and the workaround is to force the new codec to exhibit the same behavior. If you read the codec’s documentation, you’ll see that the default behavior comes when you set the line length to 76 characters. To fix the REST example change line 183 of SignatureRequestHelper to:
Base64 encoder = new Base64(76, new byte);
After doing all this, I finally got a small victory in the form of real output:
Map form example: Signed Request is "http://ecs.amazonaws.com/onca/xml?AWSAccessKeyId=...." Signed Title is "Harry Potter and the Deathly Hallows (Book 7)" String form example: Request is "http://ecs.amazonaws.com/onca/xml?AWSAccessKeyId=...." Title is "Harry Potter and the Deathly Hallows (Book 7)"
As I was updating the site, I took notice of all the ways that people (myself included) attempt to obfuscate their e-mail addresses to avoid getting their e-mail picked up by the spammers diligently crawling the web. You often see variations like
Even though the extra layer of HTML makes the already obfuscated addresses above into gobbledygook like <first_initial><last_name>@domain.com, a motivated programmer with enough time and patience could create a list of regular expressions to handle just about anything.
However, spamming is not a precise art. There is no need to get every last e-mail address, and odds are the people that list their e-mail addresses like the ones above are not going to fall for any e-mail scam. If I were a spammer, I would want to exploit the existing lists to identify known name combinations. Likewise I would imagine that most people would say they are already doing that. I would argue that going through all the combinations of first names is probably quite fruitful, but it also produces way more non-existing combinations than real ones.
Alternative, one could use these same lists in conjunction with a list of domain addresses and a simple greedy algorithm known as maxmatch to get known good combinations of first and last names. The premise is simple, the algorithm goes from left to right in the address looking for the longest matches within the names in the list.
The code looks something like this:
def max_match(word, word_list) : start = 0 words = list() last_match = None while start < len(word) : match = False for i in range(len(word), 0, -1) : if (word[start:i] in word_list) : words.append(word[start:i]) match = True start = i break if not match : words.append(word[start]) start += 1 return words
When used for word segmentation or hashtag segmentation, maxmatch usually falls victim to its greediness because English has a large number of small words that are easily consumed by larger words. Borrowing an example from my professor Jim Martin, “thetabledownthere” would get segmented into “theta bled own there”. With names this should be less of an issue, since most of them are at least four characters and the combinations tend to be less common than normal English words. For example with my name, there is little chance it would get segmented incorrectly, unless leeb was a name.
Of course, this is still probably more trouble than whatever the spammers are already doing.
About a month ago I ssh-ed into my server to discover that my bash settings were not initializing. I thought it was just a little hiccup with our server configuration, but it turned out that our system was compromised. The hacker had logged in by exploiting an ssh overload bug. From there he/she/it felt compelled to replace our ssh binaries with his own. Fortunately our files were still in tact, and I was able to download a backup of all my files and databases before we wiped the system clean.
After the reinstall, I spent a couple of days reconfiguring my websites and getting the packages necessary to get my django based dialog annotation tool back up and running. Things were as good as they were before, we even decided to upgrade to having backup service. Then just a few days later, another hacker broke in. This time he changed all of our passwords and pretty much made the server unusable. Thanks to some heroics, my friend Ian was able to log in through some minimal access and recover our files.
Seeing as much of the data for my dissertation lived on this server, I decided I needed to break away from the server that has served me well for the past seven and get a more managed service. After doing some searching, I discovered Webfaction allowed for easy configuration for many web development frameworks including django. They also provide nice tools for importing your WordPress blogs from other sites, as well as full backup of my directory and databases. Plus it ended up being cheaper per month than my share of the old server.
And now I present the LeeBecker.com affiliate program!
Commercial aside, I have enjoyed using this service, and now feel like I have better control of all things I host remotely whether it be my research tools, blogs, or code repositories. This change also has encouraged me to fully redesign my personal website in WordPress. While part of me still thinks I can do all the CSS and HTML by hand, for my needs WordPress does enough, and it brings a level of cohesion to the site that I previously lacked.
As you can probably guess from my URL and the list of links on the sidebar, you’ll quickly realize I am somewhat obsessed about disambiguating myself from the masses on the interwebs who share my somewhat common name. While, I’m no Joe Smith or David Johnson, this task still presents challenges, especially when you consider that there was a Lee Becker at in the department of Computer Science at Worcester Polytech who specialized in artificial intelligence and slavic linguistics. Further complicating matters there is another one of us just down the road at the University of Colorado, Colorado Springs in the department of Psychology. If you trust my Google first-page, selection bias, you’d be convinced that the Lee Beckers of the world must all be interested in some combination of computer science, linguistics, and cognitive science. But a quick search on LinkedIn will show that we do plenty of other things. Furthermore, based on a series of wrongly addressed e-mails, I know there are one or more Lee Beckers in New York who are somehow affiliated with sports and building management.
While this task is called named-entity disambiguation in natural language processing, I like to draw a parallel to verb sense disambiguation for my own situation. With many verbs, the dominant will account for most of its occurrences. For example sense-1 of work refers to the act of getting something done, but occasionally it gives way to sense-2, to knead or fold, like with bread. My megalomania wants the Silat-loving, computational linguist from Colorado to be sense-1, but I think I’m going to need to be more prolific and more diligent before this is going to happen.
Do other people obsess about this? Perhaps this is all a byproduct of not having a middle name.
I have long enjoyed programming brain teasers, at the same time I have been trying to find subjects that might get me blogging more consistently. Hence when I saw a list of 140 Google Interview Questions, I felt inspired to start a series that explores answers to some of these questions.
Given my interest random numbers, I thought the following question would be a good way to kick things off:
There is a linked list of numbers of length N. N is very large and you don’t know N. You have to write a function that will return k random numbers from the list. Numbers should be completely random. It should be done in O(n).
At first glance, I thought the solution would consist of simply traversing the linked list to find N, and then selecting k random numbers between 0 and N. These numbers would then be the indices into the linked list. However, after additional consideration I realized that this could be really wasteful as the linked list can’t be indexed, and the only way to retrieve the nodes of interest would be to retraverse the tree. It also seems that if the linked list represented a stream of data, one might not have the luxury of traversing through the entire list.
Instead I discovered the answer lies in a surprisingly elegant algorithm called reservoir sampling. For the first k draws, the items in the list are directly saved as the samples. Thus if the stream stops at k, we have sampled uniformly across k. The interesting part comes at the kth+1 element. We need to determine how likely it was that this ith element would be sampled if we could have sampled directly from the entire range, in this case it is with probability i/k+1. We use this probability to see if this index was selected, if so we should add it to our samples. But now it appears as if we have too many (i.e. k+1) samples. To remedy this randomly select an item from our existing reservoir of k samples and replace it with this ith value. If you do the math, it turns out this is identical to sampling with replacement.
Here is the python code for this algorithm:
def reservoir_sampling(index, samples, k=1000) : if index < k : # for the first draws up to sample_size, k, # build a reservoir of samples directly from the data samples.append(index) else : # for the ith draw, calculate following probability for selection p_select_index = float(k)/(index+1) #if (numpy.random.random() <= p_select_index) : if (random.random() <= p_select_index) : # ith draw was selected, randomly select a sample to replace index_to_replace = random.randint(0,k) samples[index_to_replace] = index
Now to give a qualitative feel here are two histograms comparing reservoir sampling and uniform sampling for k=300 and N=1000000.