Sampling from a stream

I have long enjoyed programming brain teasers, at the same time I have been trying to find subjects that might get me blogging more consistently.  Hence when I saw a list of 140 Google Interview Questions, I felt inspired to start a series that explores answers to some of these questions.

Given my interest random numbers, I thought the following question would be a good way to kick things off:

There is a linked list of numbers of length N. N is very large and you don’t know N. You have to write a function that will return k random numbers from the list. Numbers should be completely random.  It should be done in O(n).

At first glance, I thought the solution would consist of simply traversing the linked list to find N, and then selecting k random numbers between 0 and N.  These numbers would then be the indices into the linked list.  However, after additional consideration I realized that this could be really wasteful as the linked list can’t be indexed, and the only way to retrieve the nodes of interest would be to retraverse the tree.  It also seems that if the linked list represented a stream of data, one might not have the luxury of traversing through the entire list.

Instead I discovered the answer lies in a surprisingly elegant algorithm called reservoir sampling.   For the first k draws, the items in the list are directly saved as the samples.  Thus if the stream stops at k, we have sampled uniformly across k.  The interesting part comes at the kth+1 element.  We need to determine how likely it was that this ith element would be sampled if we could have sampled directly from the entire range, in this case it is with probability i/k+1.   We use this probability to see if this index was selected, if so we should add it to our samples.  But now it appears as if we have too many (i.e. k+1) samples.  To remedy this randomly select an item from our existing reservoir of k samples and replace it with this ith value.  If you do the math, it turns out this is identical to sampling with replacement.

Here is the python code for this algorithm:

def reservoir_sampling(index, samples, k=1000) :
    if index < k :
        # for the first draws up to sample_size, k,
        # build a reservoir of samples directly from the data
        samples.append(index)
    else :
        # for the ith draw, calculate following probability for selection
        p_select_index = float(k)/(index+1)
        #if (numpy.random.random() <= p_select_index) :
        if (random.random() <= p_select_index) :
            # ith draw was selected, randomly select a sample to replace
            index_to_replace = random.randint(0,k)
            samples[index_to_replace] = index

Now to give a qualitative feel here are two histograms comparing reservoir sampling and uniform sampling for k=300 and N=1000000.

Leave a Reply