Search Postgresql Archives

Re: Selecting K random rows - efficiently!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



As far as I can tell, all of the proposed solutions lack sample independence. Take the OP's suggested approach of doing something like this:

  SELECT * FROM mydata
  WHERE mydata.random_number >= (SELECT RANDOM() OFFSET 0)
  ORDER BY mydata.random_number ASC LIMIT 100

All you're doing is picking random =subsequences= from the same permutation of the original data. This is not the same as a random sample. That is, if rows A and B are adjacent in the permutation, then if A is in the sample, B will also be in it with very high probability, depending on the size of the sample. Another way of saying this is that the first element of the sample is selected randomly, the rest are completely deterministic. In a true random sample, different elements are selected independently.

On the other hand, ORDER BY RANDOM() does indeed construct true random samples, because it makes a new permutation every time. If you want to use the random_number column approach, then you need to do the same. You can accomplish this by sampling from the original permutation repeatedly, doing the above with LIMIT 1 as many times as you need. Yes this is more costly, but TANSTAAFL.

As is often observed, it's easy to create the appearance of randomness, harder to accomplish in reality.

- John Burger
  MITRE

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux