Returning a random line from a file given as input

Problem Statement:

Given a file as input, you have to return back a string representing the contents of one of the lines present in the file. The algorithm should be designed in a way such that probability of each line getting returned is same i.e. if the file has n lines, probability of each line getting returned as output would be \frac{1}{n} .

Constraints:

  • The file should be scanned only once.
  • There should be only constant extra space used.

Solution:

String getRandomLine(File file){
  String output = null;
  int linecount = 0;
  Random r = new Random();
  BufferredReader reader = new BufferedReader(new FileReader(file));
  String currentLineRead;
   while( (currentLineRead=reader.readLine())!=null ){
      linecount++;
      int rand = r.nextInt(linecount);
      if(rand+1 == linecount){
         output = currentLineRead;
      }
   }
}
return output;

Are you surprised that this simple method works? 😀

Explanation

The proof of correctness can be seen through mathematical induction on the loop invariant.

 Statement is that after seeing n lines, the probability of each line as output is \frac{1}{n}

Base case: linecount=1, then first line is stored as output as rand is 1 in first iteration. So, 1/1 = 1

Hypothesis: Assume, that it works correctly, for linecount=k, such that probability of k lines seen so far is \frac{1}{k} .

Induction step: we read k+1 th line and we get a random number between 0 to k. Probability of each random number being generated is \frac{1}{k+1} . So, the probability of current line (if random number generated is k) being stored as output becomes \frac{1}{k+1} . Let’s see what happens to probability of each of the previous k lines. The probability of getting a number in the range 0 to k-1 while generating a random number in this iteration is  \frac{k}{k+1} , and the probability of each of the previous k lines was \frac{1}{k} at end of last iteration, so now multiplying these two probabilities, the probability of each of those k previous lines becomes \frac{1}{k+1} .

An exercise

In this solution, we are generating random numbers in the range 0 to k each time. Suppose, we wanted to cut down the time taken to generate random number in each iteration. So, we generated a random decimal number outside this loop in the range [0,1) and inside the loop, we multiplied that by linecount to get an integer in the range [0, linecount). Will this still give each line in file an equal chance of being the output string?

2 thoughts on “Returning a random line from a file given as input”

Leave a Reply

Your email address will not be published. Required fields are marked *

Notify me of followup comments via e-mail. You can also subscribe without commenting.