Random sampling from a set of entities implies that each entity has an equal chance of being selected compared to any other entity. Suppose we aim to randomly select
Algorithm 1
The initial approach that comes to mind involves several steps:
- Count the number of lines in the file.
- Create a sorted random set of
integers between and the in the file. - Iterate over the random integers and read the file line by line. Pick the line if the line number matches one of the random integers.
The implementation of this algorithm in Python is shown below:
import random
def random_sampler(filename, k):
= []
sample with open(filename, 'r') as f:
# Count the number of lines in the file
= sum(1 for line in f)
linecount 0) # Go back to the start of the file
f.seek(
# Ensure the random line numbers are sorted in reverse order
= sorted(random.sample(range(linecount), k), reverse=True)
random_line_nums
if random_line_nums: # Check if there are any line numbers to process
= random_line_nums.pop()
line_num for n, line in enumerate(f):
if n == line_num:
# Strip newline characters from the end of the line
sample.append(line.rstrip()) if random_line_nums: # If there are more line numbers, continue; otherwise, break
= random_line_nums.pop()
line_num else:
break
return sample
Algorithm 2: Reservoir Sampling
Unlike the previous algorithm that requires scanning the file twice, reservoir sampling allows us to select a random sample without knowing the total number of items in advance.
While the basic concept of reservoir sampling existed earlier, it was Jeffrey Vitter who is credited with formalizing the algorithm in a way that’s widely cited today. This technique is particularly useful for sampling from a stream of data.
Here’s how reservoir sampling works:
- To select
items from a set of items, we start by filling the “reservoir” with the first items from the stream. - For every subsequent
item, we generate a random number between 1 and . - If
is within the range of (i.e., less than or equal to ), we replace the item in the reservoir with the item. - This process is continued for the entire stream of items.
The following Python code implements reservoir sampling:
import random
def random_sampler(filename, k):
= []
sample with open(filename) as f:
for n, line in enumerate(f):
if n < k:
sample.append(line.rstrip())else:
# Generating a random index in the range from 0 to n (inclusive of n)
= random.randint(0, n)
r if r < k:
= line.rstrip()
sample[r] return sample
It is straightforward to demonstrate by induction that the reservoir sampling approach ensures each line has an equal probability of being selected:
Suppose we aim to collect a random sample of
For example, consider
When the
The
For items already in the reservoir, the probability of remaining in the sample after the
This calculation starts with the initial probability, which is
Therefore, the adjusted probability is calculated as:
This reasoning applies similarly for the
While reservoir sampling offers a fair and memory-efficient approach for sampling from large datasets or streams, it may not be the fastest option for all scenarios, especially where processing speed is a critical factor.
Context and Introduction to Algorithm 3
In 2013, while working on a project that required analyzing hundreds of millions of email addresses stored in a large text file, I faced the challenge of efficiently sampling this dataset. Given the impracticality of processing every line in a file of such magnitude, and considering that email addresses tend to be relatively uniform in length, I devised a sampling strategy that utilizes the file’s size as a proxy for the distribution of line positions. This method is particularly suited for datasets like mine, where the data is uniformly distributed in terms of line length, allowing for an innovative approach to random sampling.
Algorithm 3: Sampling Based on File Size for Uniform Line Lengths
Understanding the necessity of an efficient and effective sampling method led to the development of the following algorithm, which is especially applicable to large files with uniformly lengthed lines:
import random
def random_sampler(filename, k):
= []
sample with open(filename, 'rb') as f:
0, 2) # Move to the end of the file
f.seek(= f.tell() # Get the file size
file_size
# Generate a sorted set of k random positions within the file
= sorted(random.sample(range(file_size), k))
random_set
for i in range(k):
# Seek to a random position
f.seek(random_set[i])# Skip the current line to avoid partial lines
f.readline()# Append the next line to the sample set
sample.append(f.readline().rstrip())
return sample
Benchmark
The following table illustrates the elapsed time required to select 1,000 lines from two distinct file sizes: a large file (approximately 40 million lines) and a very large file (approximately 300 million lines), using each algorithm.
Algorithm | File 1 (~ 40M lines) | File 2 (~ 300M lines) |
---|---|---|
random_sampler1.py | 6.64s | 74.18s |
random_sampler2.py | 50.41s | 411.08s |
random_sampler3.py | 0.02s | 3.12s |
We observe that Algorithm 3 significantly outperforms the others in terms of speed. As previously noted, the primary assumption for this efficiency is that the lines within these files are of roughly uniform length.