Search Postgresql Archives

Re: Selecting K random rows - efficiently!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



cluster wrote:
It has been suggested [1] that a good way to select K random rows efficiently from a table is to
  1) add a new column, random_number, to the table
     and initialize it with random()
  2) perform the following query:
       SELECT *
       FROM mydata
       WHERE random_number >= (SELECT RANDOM() OFFSET 0)
       ORDER BY random_number ASC LIMIT <K>
     Here <K> is the number of random rows to pick. E.g. 100.

The benefit in this solution is that the "random_number" column can be indexed, allowing the query to be performed using a simple and fast index scan.

However, there is a couple of major drawbacks in this method:

1) The smaller "random_number" is, the less likely is it that it will be picked when using this method. Example: A random_number close to zero will only have a very small probability to be selected.
When the above query returns L rows (where L < K) then, you need to append the first K - L rows from the table to simulate a "ring" without start or end. (Conveniently, this also solves the problem of not finding K rows because the random start value was too large.)

The second set of rows can certainly be fetched using a second SELECT statement. Whether this can be computed efficiently as a single SELECT statement I am not sure but you might try something like this:

 (SELECT 1 AS seq, *
 FROM mydata
 WHERE random_number >= (SELECT RANDOM() OFFSET 0)
 ORDER BY random_number ASC LIMIT <K>)
UNION ALL
 (SELECT 2 AS seq, *
 FROM mydata
 ORDER BY random_number ASC LIMIT <K>)
ORDER BY seq ASC, random_number ASC LIMIT K;

This should provide each row with an equal chance of being selected while requiring the database to fetch at most 2 * K rows.

Regards,

Paul Tillotson

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux