Re: Recursive CTEs and randomness - is there something I'm missing?

Tom Lane <tgl@xxxxxxxxxxxxx> · Fri, 28 Feb 2020 12:45:29 -0500

=?UTF-8?B?UMOzbCBVYSBMYW/DrW5lY2jDoWlu?= <linehanp@xxxxxx> writes:
> The SQL:

> WITH RECURSIVE rand (num, md, a_2_s) AS
> (
>   SELECT
>     1,
>     MD5(RANDOM()::TEXT),
>     ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>                           FROM GENERATE_SERIES(1, 5)), '')
>   UNION
>     SELECT num + 1,
>     MD5(RANDOM()::TEXT),
>     ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
>                           FROM GENERATE_SERIES(1, 5)), '')
>   FROM rand
>   WHERE num < 5
> )
> SELECT * FROM rand;

> A typical result is shown below:

> 1  974ee059a1902e5ca1ec73c91275984b     GYXYS
> 2  6cf5a974d5859eae23cdb9c310e3a3bf       YFDPT
> 3  fa6be95eb720fe6f80c7c8fb6ba11171         YFDPT
> 4  fa54913b0bb43de0025b153fd71a5334      YFDPT
> 5  523fab9bdc6c4c51a89e0d901273fb69       YFDPT

> The fact that the last 4 are identical is not a coincidence. If I put
> 100 in the GENERATE_SERIES, I still get the same result, the first and
> second records are different, but ALL subsequent instances of the
> ARRAY_TO_STRING are identical!

Yeah, it's weird.  A look at EXPLAIN VERBOSE offers some insight:

                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 CTE Scan on rand  (cost=4.75..5.37 rows=31 width=68)
   Output: rand.num, rand.md, rand.a_2_s
   CTE rand
     ->  Recursive Union  (cost=0.13..4.75 rows=31 width=68)
           ->  Result  (cost=0.13..0.15 rows=1 width=68)
                 Output: 1, md5((random())::text), array_to_string($1, ''::text)
                 InitPlan 1 (returns $1)
                   ->  Function Scan on pg_catalog.generate_series  (cost=0.00..0.13 rows=5 width=32)
                         Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
                         Function Call: generate_series(1, 5)
           ->  WorkTable Scan on rand rand_1  (cost=0.13..0.40 rows=3 width=68)
                 Output: (rand_1.num + 1), md5((random())::text), array_to_string($2, ''::text)
                 Filter: (rand_1.num < 5)
                 InitPlan 2 (returns $2)
                   ->  Function Scan on pg_catalog.generate_series generate_series_1  (cost=0.00..0.13 rows=5 width=32)
                         Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
                         Function Call: generate_series(1, 5)
(17 rows)

The ARRAY sub-selects are being done as initplans, not subplans,
which means they're only done once not once per row.  This is correct
so far as the planner is concerned because those sub-selects are
"uncorrelated", ie they use no values from the outer query.

There is room to argue that because the sub-selects contain volatile
functions, they ought not be optimized into initplans.  We have
traditionally not considered that, however, and I'm afraid that a
lot of people have written queries that depend on it.  For example,
there's lore out there that replacing
	WHERE mycol < random()
with
	WHERE mycol < (SELECT random())
will freeze the random() result as a single value rather than
computing a new value for each row, which sometimes is what you
need.  These days, better practice would be to put the random()
call in a CTE, but there's still a lot of code out there that
does it as above.

For your immediate problem, since you don't care that much
(I suppose) about exactly how the strings are generated, you
could fix the issue by making the sub-selects depend on
"num" somehow.  Or possibly there's a way to do it without
a sub-select.  On the whole this looks like a mighty expensive
way to generate a random string, so I'd be inclined to look
for other implementations.

			regards, tom lane