Category Archives: Programming

Lessons Learned with the Amazon Product Advertising API

I spent a good part of this evening (can you call it evening when it’s well past midnight?) trying to learn how to do a simple Java-based query using Amazon’s Product Advertising API.  This entire exercise could have been finished within an hour had there been clear, concise documentation.  Instead of one recipe for success, Amazon presents a dizzying array of approaches that lead to many dead ends.

In hopes of saving others from encountering these difficulties, I present you with my findings. For those of you that just want the recipe. Jump to the bottom.

Lesson 1 – Don’t Trust the Getting Started Guide

Admittedly the getting started guide gives a useful overview of the functionality and the approach to using the API. However, following the instructions step by step will only lead to frustration. The Java example for
Implementing a Product Advertising API Request DOES NOT WORK!. One would think that simply swapping in the proper access key where it says “YOUR ID” would be all that is needed, but upon execution I found it yields the following:

Exception in thread "main" com.sun.xml.internal.ws.client.ClientTransportException:
The server sent HTTP status code 400: Bad Request

Thinking I had omitted something small, I looked into resolving this error only to discover:

Lesson 2 – APA API’s Authentication has changed

As of August 15, 2009, the API requires a signature mechanism for authentication. In addition to invalidating all the code from the getting started guide, it also adds additional poorly documented steps to the process. Amazon does provide some detail, but it’s probably not the quickest path to get up and running.

Lesson 3 – There are some semi-functional examples

After digging around in the documentation, I found these two examples: Product Advertising API Signed Requests Sample Code – Java REST/QUERY and Product Advertising API Signed Requests Sample Code – Java SOAP. Since everything I had tried up until this point had been SOAP-centric, I decided to try the SOAP example first. Upon the code into Eclipse, I found that this example was fraught with too many errors and dependencies, so I turned to the REST example.

The REST code was clear and mostly error free. The few errors I saw were caused by the absence of the Apache Commons Codec library. After downloading this jar and adding it to my classpath, the example code finally compiled. Unfortunately, when I went to run it, I was greeted with this exception:

Server returned HTTP response code: 403 for URL: http://ecs.amazonaws.com/onca/xml?....

Lesson 4 – Apache Commons Codec 1.3 and 1.4 are different

After crawling through the forums looking for answers, I found out that the REST example above depended on Apache Commons Codec version 1.3, whereas the version I downloaded was 1.4. It turns out the old version appended extra CRLF (\r\n) characters onto the authentication signature, and the workaround is to force the new codec to exhibit the same behavior. If you read the codec’s documentation, you’ll see that the default behavior comes when you set the line length to 76 characters. To fix the REST example change line 183 of SignatureRequestHelper to:

Base64 encoder = new Base64(76, new byte[0]);

After doing all this, I finally got a small victory in the form of real output:

Map form example:
Signed Request is "http://ecs.amazonaws.com/onca/xml?AWSAccessKeyId=...."
Signed Title is "Harry Potter and the Deathly Hallows (Book 7)"

String form example:
Request is "http://ecs.amazonaws.com/onca/xml?AWSAccessKeyId=...."
Title is "Harry Potter and the Deathly Hallows (Book 7)"

TL;DR

You can use REST to query information about Amazon products by 1) using this code, 2) downloading this library and 3) changing line the Base64 encoder to include CRLF.

Helping the spammers

As I was updating the site, I took notice of all the ways that people (myself included) attempt to obfuscate their e-mail addresses to avoid getting their e-mail picked up by the spammers diligently crawling the web.  You often see variations like

  • first.last@domain.com
  • first@lastname.com
  • first[dot]last@domain.com
  • <first_initial><last_name>@domain.com

Even though the extra layer of HTML makes the already obfuscated addresses above into gobbledygook like &lt;first_initial&gt;&lt;last_name&gt;@domain.com, a motivated programmer with enough time and patience could create a list of regular expressions to handle just about anything.

However,  spamming is not a precise art.  There is no need to get every last e-mail address, and odds are the people that list their e-mail addresses like the ones above are not going to fall for any e-mail scam.  If I were a spammer, I would want to exploit the existing lists to identify known name combinations.  Likewise I would imagine that most people would say they are already doing that.  I would argue that going through all the combinations of first names is probably quite fruitful, but it also produces way more non-existing combinations than real ones.

Alternative, one could use these same lists in conjunction with a list of domain addresses and a simple greedy algorithm known as maxmatch to get known good combinations of first and last names.  The premise is simple, the algorithm goes from left to right in the address looking for the longest matches within the names in the list.

The code looks something like this:

def max_match(word, word_list) :
    start = 0
    words = list()
    last_match = None

    while start < len(word) :
        match = False
        for i in range(len(word), 0, -1) :
            if (word[start:i] in word_list) :
                words.append(word[start:i])
                match = True
                start = i
                break
        if not match :
            words.append(word[start])
            start += 1
    return words

When used for word segmentation or hashtag segmentation, maxmatch usually falls victim to its greediness because English has a large number of small words that are easily consumed by larger words. Borrowing an example from my professor Jim Martin, “thetabledownthere” would get segmented into “theta bled own there”. With names this should be less of an issue, since most of them are at least four characters and the combinations tend to be less common than normal English words. For example with my name, there is little chance it would get segmented incorrectly, unless leeb was a name.

Of course, this is still probably more trouble than whatever the spammers are already doing.

Sampling from a stream

I have long enjoyed programming brain teasers, at the same time I have been trying to find subjects that might get me blogging more consistently.  Hence when I saw a list of 140 Google Interview Questions, I felt inspired to start a series that explores answers to some of these questions.

Given my interest random numbers, I thought the following question would be a good way to kick things off:

There is a linked list of numbers of length N. N is very large and you don’t know N. You have to write a function that will return k random numbers from the list. Numbers should be completely random.  It should be done in O(n).

At first glance, I thought the solution would consist of simply traversing the linked list to find N, and then selecting k random numbers between 0 and N.  These numbers would then be the indices into the linked list.  However, after additional consideration I realized that this could be really wasteful as the linked list can’t be indexed, and the only way to retrieve the nodes of interest would be to retraverse the tree.  It also seems that if the linked list represented a stream of data, one might not have the luxury of traversing through the entire list.

Instead I discovered the answer lies in a surprisingly elegant algorithm called reservoir sampling.   For the first k draws, the items in the list are directly saved as the samples.  Thus if the stream stops at k, we have sampled uniformly across k.  The interesting part comes at the kth+1 element.  We need to determine how likely it was that this ith element would be sampled if we could have sampled directly from the entire range, in this case it is with probability i/k+1.   We use this probability to see if this index was selected, if so we should add it to our samples.  But now it appears as if we have too many (i.e. k+1) samples.  To remedy this randomly select an item from our existing reservoir of k samples and replace it with this ith value.  If you do the math, it turns out this is identical to sampling with replacement.

Here is the python code for this algorithm:

def reservoir_sampling(index, samples, k=1000) :
    if index < k :
        # for the first draws up to sample_size, k,
        # build a reservoir of samples directly from the data
        samples.append(index)
    else :
        # for the ith draw, calculate following probability for selection
        p_select_index = float(k)/(index+1)
        #if (numpy.random.random() <= p_select_index) :
        if (random.random() <= p_select_index) :
            # ith draw was selected, randomly select a sample to replace
            index_to_replace = random.randint(0,k)
            samples[index_to_replace] = index

Now to give a qualitative feel here are two histograms comparing reservoir sampling and uniform sampling for k=300 and N=1000000.