On Wed, Oct 24, 2007 at 10:59:46AM +0200, cluster wrote: > Another way to look at the problem is: How do I sample a subset of size > K efficiently? A query like > > SAMPLE 1000 OF > (SELECT * FROM mydata WHERE <some condition>) How important is true randomness? To get the best possible distribution most algorithms require you to either know how many rows there are, or require you to scan the whole table (or index). With some simplifying assumptions, you can try extracting them from an index, with the caveat that if your index is unbalanced in any way, the selection won't be "random". > should return 1000 random rows from the select statement so that two > consecutive evaluations of the query would only with very little > probability return the same 1000 rows. > (Yes, I know that "SAMPLE 1000 OF" is not valid SQL) Presumably your table is very much bigger than that, in which I suppose the not-entirely-random is unlikely to play much of a role. Search the archives, there have been solutions proposed before, though they probably arn't very quick... Have a nice day, -- Martijn van Oosterhout <kleptog@xxxxxxxxx> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment:
signature.asc
Description: Digital signature