On 2/28/20 9:45 AM, Tom Lane wrote:
=?UTF-8?B?UMOzbCBVYSBMYW/DrW5lY2jDoWlu?= <linehanp@xxxxxx> writes:
The SQL:
WITH RECURSIVE rand (num, md, a_2_s) AS
(
SELECT
1,
MD5(RANDOM()::TEXT),
ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
FROM GENERATE_SERIES(1, 5)), '')
UNION
SELECT num + 1,
MD5(RANDOM()::TEXT),
ARRAY_TO_STRING(ARRAY(SELECT chr((65 + round(random() * 25)) :: integer)
FROM GENERATE_SERIES(1, 5)), '')
FROM rand
WHERE num < 5
)
SELECT * FROM rand;
A typical result is shown below:
1 974ee059a1902e5ca1ec73c91275984b GYXYS
2 6cf5a974d5859eae23cdb9c310e3a3bf YFDPT
3 fa6be95eb720fe6f80c7c8fb6ba11171 YFDPT
4 fa54913b0bb43de0025b153fd71a5334 YFDPT
5 523fab9bdc6c4c51a89e0d901273fb69 YFDPT
The fact that the last 4 are identical is not a coincidence. If I put
100 in the GENERATE_SERIES, I still get the same result, the first and
second records are different, but ALL subsequent instances of the
ARRAY_TO_STRING are identical!
Yeah, it's weird. A look at EXPLAIN VERBOSE offers some insight:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
CTE Scan on rand (cost=4.75..5.37 rows=31 width=68)
Output: rand.num, rand.md, rand.a_2_s
CTE rand
-> Recursive Union (cost=0.13..4.75 rows=31 width=68)
-> Result (cost=0.13..0.15 rows=1 width=68)
Output: 1, md5((random())::text), array_to_string($1, ''::text)
InitPlan 1 (returns $1)
-> Function Scan on pg_catalog.generate_series (cost=0.00..0.13 rows=5 width=32)
Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
Function Call: generate_series(1, 5)
-> WorkTable Scan on rand rand_1 (cost=0.13..0.40 rows=3 width=68)
Output: (rand_1.num + 1), md5((random())::text), array_to_string($2, ''::text)
Filter: (rand_1.num < 5)
InitPlan 2 (returns $2)
-> Function Scan on pg_catalog.generate_series generate_series_1 (cost=0.00..0.13 rows=5 width=32)
Output: chr((('65'::double precision + round((random() * '25'::double precision))))::integer)
Function Call: generate_series(1, 5)
(17 rows)
The ARRAY sub-selects are being done as initplans, not subplans,
which means they're only done once not once per row. This is correct
so far as the planner is concerned because those sub-selects are
"uncorrelated", ie they use no values from the outer query.
There is room to argue that because the sub-selects contain volatile
functions, they ought not be optimized into initplans. We have
traditionally not considered that, however, and I'm afraid that a
lot of people have written queries that depend on it. For example,
there's lore out there that replacing
WHERE mycol < random()
with
WHERE mycol < (SELECT random())
will freeze the random() result as a single value rather than
computing a new value for each row, which sometimes is what you
need. These days, better practice would be to put the random()
call in a CTE, but there's still a lot of code out there that
does it as above.
For your immediate problem, since you don't care that much
(I suppose) about exactly how the strings are generated, you
could fix the issue by making the sub-selects depend on
"num" somehow. Or possibly there's a way to do it without
a sub-select. On the whole this looks like a mighty expensive
way to generate a random string, so I'd be inclined to look
for other implementations.
An off the cuff Python solution:
CREATE OR REPLACE FUNCTION public.upper_random()
RETURNS character varying
LANGUAGE plpythonu
AS $function$
from string import ascii_uppercase as au
import random
return ''.join(random.sample(au,5))
$function$
SELECT
num,
MD5(RANDOM()::TEXT),
upper_random()
FROM
generate_series(1, 5) AS t(num);
num | md5 | upper_random
-----+----------------------------------+--------------
1 | 5896a9e3efa53027873d7999e58904ae | TRGEB
2 | 9f9677c32a64b6eae73759a69e1acfff | TFQHG
3 | 5aefda5b498215065e01ba697d79caee | KYBZS
4 | 1605b7fa54fef9bdc5c49f9b79810e07 | EUDML
5 | 4ba59880c1c67bca1d1f184bda5350b6 | RFPGO
regards, tom lane
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx