Re: Selecting K random rows - efficiently!

Martijn van Oosterhout <kleptog@xxxxxxxxx> · Wed, 24 Oct 2007 15:08:11 +0200

On Wed, Oct 24, 2007 at 10:59:46AM +0200, cluster wrote:
> Another way to look at the problem is: How do I sample a subset of size 
> K efficiently? A query like
> 
>    SAMPLE 1000 OF
>    (SELECT * FROM mydata WHERE <some condition>)

How important is true randomness? To get the best possible distribution
most algorithms require you to either know how many rows there are, or
require you to scan the whole table (or index).

With some simplifying assumptions, you can try extracting them from an
index, with the caveat that if your index is unbalanced in any way, the
selection won't be "random".

> should return 1000 random rows from the select statement so that two 
> consecutive evaluations of the query would only with very little 
> probability return the same 1000 rows.
> (Yes, I know that "SAMPLE 1000 OF" is not valid SQL)

Presumably your table is very much bigger than that, in which I suppose
the not-entirely-random is unlikely to play much of a role.

Search the archives, there have been solutions proposed before, though
they probably arn't very quick...

Have a nice day,
-- 
Martijn van Oosterhout   <kleptog@xxxxxxxxx>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.
Attachment:
signature.asc

Description: Digital signature