Meta Data Science - Random Sampling From a Large Text File

Random sampling from a set of entities implies that each entity has an equal chance of being selected compared to any other entity. Suppose we aim to randomly select \(k\) lines from a large text file containing hundreds of millions of lines. Our goal is to ensure that the probability of selection is identical for every line within the file.

Algorithm 1

The initial approach that comes to mind involves several steps:

Count the number of lines in the file.
Create a sorted random set of \(k\) integers between \(1\) and the \(number of lines\) in the file.
Iterate over the random integers and read the file line by line. Pick the line if the line number matches one of the random integers.

The implementation of this algorithm in Python is shown below:


import random

def random_sampler(filename, k):
    sample = []
    with open(filename, 'r') as f:
        # Count the number of lines in the file
        linecount = sum(1 for line in f)
        f.seek(0)  # Go back to the start of the file

        # Ensure the random line numbers are sorted in reverse order
        random_line_nums = sorted(random.sample(range(linecount), k), reverse=True)
        
        if random_line_nums:  # Check if there are any line numbers to process
            line_num = random_line_nums.pop()
            for n, line in enumerate(f):
                if n == line_num:
                    sample.append(line.rstrip())  # Strip newline characters from the end of the line
                    if random_line_nums:  # If there are more line numbers, continue; otherwise, break
                        line_num = random_line_nums.pop()
                    else:
                        break
    return sample

Algorithm 2: Reservoir Sampling

Unlike the previous algorithm that requires scanning the file twice, reservoir sampling allows us to select a random sample without knowing the total number of items in advance.

While the basic concept of reservoir sampling existed earlier, it was Jeffrey Vitter who is credited with formalizing the algorithm in a way that’s widely cited today. This technique is particularly useful for sampling from a stream of data.

Here’s how reservoir sampling works:

To select \(k\) items from a set of items, we start by filling the “reservoir” with the first \(k\) items from the stream.
For every subsequent \(i^{th}\) item, we generate a random number \(r\) between 1 and \(i\).
If \(r\) is within the range of \(k\) (i.e., less than or equal to \(k\)), we replace the \(r^{th}\) item in the reservoir with the \(i^{th}\) item.
This process is continued for the entire stream of items.

The following Python code implements reservoir sampling:

import random

def random_sampler(filename, k):
    sample = []
    with open(filename) as f:
        for n, line in enumerate(f):
            if n < k:
                sample.append(line.rstrip())
            else:
                # Generating a random index in the range from 0 to n (inclusive of n)
                r = random.randint(0, n)
                if r < k:
                    sample[r] = line.rstrip()
    return sample

It is straightforward to demonstrate by induction that the reservoir sampling approach ensures each line has an equal probability of being selected:

Suppose we aim to collect a random sample of \(k\) items from a stream of items. We desire that after observing \(n\) items, each item in the sample set has a \(\frac{k}{n}\) chance of being included.

For example, consider \(k=10\). Initially, the first 10 items are added to the reservoir, each with a selection probability of \(\frac{10}{10} = 1 \checkmark\)

When the \(11^{th}\) item arrives, we want the selection probability to be \(\frac{10}{11}\). For this:

The \(11^{th}\) item’s selection probability is indeed \(\frac{10}{11} \checkmark\)

For items already in the reservoir, the probability of remaining in the sample after the \(11^{th}\) item is considered equals their initial selection probability multiplied by the chance of not being replaced.

This calculation starts with the initial probability, which is \(\frac{10}{10}\). We then adjust this probability to account for the introduction of the \(11^{th}\) item. To do this, we subtract the chance that any one of the original 10 items gets replaced by the \(11^{th}\) item. This replacement probability is the product of two factors: the selection probability of the \(11^{th}\) item, which is \(\frac{10}{11}\), and the likelihood that any specific item out of the 10 is chosen for replacement, which is \(\frac{1}{10}\).

Therefore, the adjusted probability is calculated as: \(Pr = \frac{10}{10} \times (1 - \frac{10}{11} \times \frac{1}{10}) = \frac{10}{11} \checkmark\)

This reasoning applies similarly for the \(12^{th}\) item and subsequent items, ensuring each has an equal chance of selection as the dataset grows.

While reservoir sampling offers a fair and memory-efficient approach for sampling from large datasets or streams, it may not be the fastest option for all scenarios, especially where processing speed is a critical factor.

Context and Introduction to Algorithm 3

In 2013, while working on a project that required analyzing hundreds of millions of email addresses stored in a large text file, I faced the challenge of efficiently sampling this dataset. Given the impracticality of processing every line in a file of such magnitude, and considering that email addresses tend to be relatively uniform in length, I devised a sampling strategy that utilizes the file’s size as a proxy for the distribution of line positions. This method is particularly suited for datasets like mine, where the data is uniformly distributed in terms of line length, allowing for an innovative approach to random sampling.

Algorithm 3: Sampling Based on File Size for Uniform Line Lengths

Understanding the necessity of an efficient and effective sampling method led to the development of the following algorithm, which is especially applicable to large files with uniformly lengthed lines:

import random


def random_sampler(filename, k):
    sample = []
    with open(filename, 'rb') as f:
        f.seek(0, 2)  # Move to the end of the file
        file_size = f.tell()  # Get the file size

        # Generate a sorted set of k random positions within the file
        random_set = sorted(random.sample(range(file_size), k))

        for i in range(k):
            # Seek to a random position
            f.seek(random_set[i])
            # Skip the current line to avoid partial lines
            f.readline()
            # Append the next line to the sample set
            sample.append(f.readline().rstrip())

    return sample

Benchmark

The following table illustrates the elapsed time required to select 1,000 lines from two distinct file sizes: a large file (approximately 40 million lines) and a very large file (approximately 300 million lines), using each algorithm.

Algorithm	File 1 (~ 40M lines)	File 2 (~ 300M lines)
random_sampler1.py	6.64s	74.18s
random_sampler2.py	50.41s	411.08s
random_sampler3.py	0.02s	3.12s

We observe that Algorithm 3 significantly outperforms the others in terms of speed. As previously noted, the primary assumption for this efficiency is that the lines within these files are of roughly uniform length.