Search Postgresql Archives

Re: Recursive CTEs and randomness - is there something I'm missing?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/28/20 9:45 AM, Tom Lane wrote:
=?UTF-8?B?UMOzbCBVYSBMYW/DrW5lY2jDoWlu?= <linehanp@xxxxxx> writes:
The SQL:

WITH RECURSIVE rand (num, md, a_2_s) AS
(
   SELECT
     1,
     MD5(RANDOM()::TEXT),
     ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
                           FROM GENERATE_SERIES(1, 5)), '')
   UNION
     SELECT num + 1,
     MD5(RANDOM()::TEXT),
     ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
                           FROM GENERATE_SERIES(1, 5)), '')
   FROM rand
   WHERE num < 5
)
SELECT * FROM rand;

A typical result is shown below:

1  974ee059a1902e5ca1ec73c91275984b     GYXYS
2  6cf5a974d5859eae23cdb9c310e3a3bf       YFDPT
3  fa6be95eb720fe6f80c7c8fb6ba11171         YFDPT
4  fa54913b0bb43de0025b153fd71a5334      YFDPT
5  523fab9bdc6c4c51a89e0d901273fb69       YFDPT

The fact that the last 4 are identical is not a coincidence. If I put
100 in the GENERATE_SERIES, I still get the same result, the first and
second records are different, but ALL subsequent instances of the
ARRAY_TO_STRING are identical!

Yeah, it's weird.  A look at EXPLAIN VERBOSE offers some insight:

                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
  CTE Scan on rand  (cost=4.75..5.37 rows=31 width=68)
    Output: rand.num, rand.md, rand.a_2_s
    CTE rand
      ->  Recursive Union  (cost=0.13..4.75 rows=31 width=68)
            ->  Result  (cost=0.13..0.15 rows=1 width=68)
                  Output: 1, md5((random())::text), array_to_string($1, ''::text)
                  InitPlan 1 (returns $1)
                    ->  Function Scan on pg_catalog.generate_series  (cost=0.00..0.13 rows=5 width=32)
                          Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
                          Function Call: generate_series(1, 5)
            ->  WorkTable Scan on rand rand_1  (cost=0.13..0.40 rows=3 width=68)
                  Output: (rand_1.num + 1), md5((random())::text), array_to_string($2, ''::text)
                  Filter: (rand_1.num < 5)
                  InitPlan 2 (returns $2)
                    ->  Function Scan on pg_catalog.generate_series generate_series_1  (cost=0.00..0.13 rows=5 width=32)
                          Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
                          Function Call: generate_series(1, 5)
(17 rows)

The ARRAY sub-selects are being done as initplans, not subplans,
which means they're only done once not once per row.  This is correct
so far as the planner is concerned because those sub-selects are
"uncorrelated", ie they use no values from the outer query.

There is room to argue that because the sub-selects contain volatile
functions, they ought not be optimized into initplans.  We have
traditionally not considered that, however, and I'm afraid that a
lot of people have written queries that depend on it.  For example,
there's lore out there that replacing
	WHERE mycol < random()
with
	WHERE mycol < (SELECT random())
will freeze the random() result as a single value rather than
computing a new value for each row, which sometimes is what you
need.  These days, better practice would be to put the random()
call in a CTE, but there's still a lot of code out there that
does it as above.

For your immediate problem, since you don't care that much
(I suppose) about exactly how the strings are generated, you
could fix the issue by making the sub-selects depend on
"num" somehow.  Or possibly there's a way to do it without
a sub-select.  On the whole this looks like a mighty expensive
way to generate a random string, so I'd be inclined to look
for other implementations.

An off the cuff Python solution:

CREATE OR REPLACE FUNCTION public.upper_random()
 RETURNS character varying
 LANGUAGE plpythonu
AS $function$
from string import ascii_uppercase as au
import random
return  ''.join(random.sample(au,5))
$function$

SELECT
    num,
    MD5(RANDOM()::TEXT),
    upper_random()
FROM
	generate_series(1, 5) AS t(num);

 num |               md5                | upper_random
-----+----------------------------------+--------------
   1 | 5896a9e3efa53027873d7999e58904ae | TRGEB
   2 | 9f9677c32a64b6eae73759a69e1acfff | TFQHG
   3 | 5aefda5b498215065e01ba697d79caee | KYBZS
   4 | 1605b7fa54fef9bdc5c49f9b79810e07 | EUDML
   5 | 4ba59880c1c67bca1d1f184bda5350b6 | RFPGO




			regards, tom lane




--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux