Re: Performance of ORDER BY RANDOM to select random rows?

Victor Hooi <victorhooi@xxxxxxxxx> · Mon, 12 Aug 2013 14:59:11 +1000

Hi,
@Hubert/@Sergey - Thanks for your response.

Hmm, aha, so the ORDER BY RANDOM behaviour hasn't changed - just to confirm - this means that Postgres will duplicate the table, add a new column, generate random numbers for every record, then sort by that new column, right?

I've just read the above anecdotally on the internet, but I'm curious if the actual implementation is documented somewhere officially apart from the source? Running the query through EXPLAIN didn't seem to tell me much additional information.

@Sergey - Thanks for the tip about using WITH RECURSIVE. I'm actually doing something similar in my application code in Django - basically take the max id, then generate a random integer between 0 and max id. However, it is dependent on how evenly distributed the record IDs are - in our case, if we delete a large number of records, it might affect things.

Are there any other database backends that do things differently?

(I know that SQL Server suggests using NEWID to do things - http://msdn.microsoft.com/en-us/library/cc441928.aspx).

Cheers,
Victor

On Fri, Aug 9, 2013 at 10:43 AM, Sergey Konoplev <gray.ru@xxxxxxxxx> wrote:

On Wed, Aug 7, 2013 at 7:01 PM, Victor Hooi <victorhooi@xxxxxxxxx> wrote:

> also seems to suggest that using ORDER BY RANDOM() will perform poorly on

> Postgres.

>

> I'm just wondering if this is still the case?

>

> I just ran those benchmarks on my system (Postgres 9.2.4), and using ORDERY

> BY RANDOM did not seem substantially to generating random integers in Python

> and picking those out (and handling non-existent rows).

>

> Has Postgres's behaviour for ORDER BY RANDOM change sometime recently?

Unfortunately, It has not. However, there always is a workaround. You

can get a random results fast by WITH RECURSIVE query.

WITH RECURSIVE r AS (

    WITH b AS (SELECT min(id), max(id) FROM table1)

    (

        SELECT id, min, max, array[]::integer[] AS a, 0 AS n

        FROM table1, b

        WHERE id > min + (max - min) * random()

        LIMIT 1

    ) UNION ALL (

        SELECT t.id, min, max, a || t.id, r.n + 1 AS n

        FROM table1 AS t, r

        WHERE

            t.id > min + (max - min) * random() AND

            t.id <> all(a) AND

            r.n + 1 < 10

        LIMIT 1

    )

)

SELECT t.id FROM table1 AS t, r WHERE r.id = t.id;

The general idea is that we get a random value between min(id) and

max(id) and then get the first row with id bigger than this value.

Then we repeat until we get 10 of such rows, checking that this id has

not been retrieved earlier.

Surely, the probability of appearing one or another value in the

result depends on the distribution of id values in the table, but in

the most cases I faced it works good.

I had an idea to play with pg_stats.histogram_bounds to work around

the described issue, but it was never so critical for tasks I solved.

--

Kind regards,

Sergey Konoplev

PostgreSQL Consultant and DBA

http://www.linkedin.com/in/grayhemp

+1 (415) 867-9984, +7 (901) 903-0499, +7 (988) 888-1979

gray.ru@xxxxxxxxx