Re: Selecting K random rows - efficiently!

"John D. Burger" <john@xxxxxxxxx> · Fri, 26 Oct 2007 00:01:38 -0400

As far as I can tell, all of the proposed solutions lack sample  
independence.  Take the OP's suggested approach of doing something  
like this:

  SELECT * FROM mydata
  WHERE mydata.random_number >= (SELECT RANDOM() OFFSET 0)
  ORDER BY mydata.random_number ASC LIMIT 100

All you're doing is picking random =subsequences= from the same  
permutation of the original data.  This is not the same as a random  
sample.  That is, if rows A and B are adjacent in the permutation,  
then if A is in the sample, B will also be in it with very high  
probability, depending on the size of the sample.  Another way of  
saying this is that the first element of the sample is selected  
randomly, the rest are completely deterministic.  In a true random  
sample, different elements are selected independently.

On the other hand, ORDER BY RANDOM() does indeed construct true  
random samples, because it makes a new permutation every time.  If  
you want to use the random_number column approach, then you need to  
do the same.  You can accomplish this by sampling from the original  
permutation repeatedly, doing the above with LIMIT 1 as many times as  
you need.  Yes this is more costly, but TANSTAAFL.

As is often observed, it's easy to create the appearance of  
randomness, harder to accomplish in reality.

- John Burger
  MITRE

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings