Re: Query became very slow after 9.6 -> 10 upgrade

Dmitry Shalashov <skaurus@xxxxxxxxx> · Thu, 23 Nov 2017 17:58:23 +0300

We tried to apply the patch on 10.1 source, but something is wrong it seems:
patch -p1 < ../1.patch
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/backend/optimizer/plan/analyzejoins.c
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/backend/utils/adt/selfuncs.c
Hunk #1 succeeded at 3270 (offset -91 lines).
Hunk #2 succeeded at 3304 (offset -91 lines).
Hunk #3 succeeded at 3313 (offset -91 lines).
Hunk #4 succeeded at 3393 (offset -91 lines).
patch unexpectedly ends in middle of line
Hunk #5 succeeded at 3570 with fuzz 1 (offset -91 lines).

Dmitry Shalashov, relap.io & surfingbird.ru

2017-11-23 2:07 GMT+03:00 Tom Lane <tgl@xxxxxxxxxxxxx>:
Dmitry Shalashov <skaurus@xxxxxxxxx> writes:

> Turns out we had not 9.6 but 9.5.

I'd managed to reproduce the weird planner behavior locally in the

regression database:

regression=# create table foo (f1 int[], f2 int);

CREATE TABLE

regression=# explain select * from tenk1 where unique2 in (select distinct unnest(f1) from foo where f2=1);

                                    QUERY PLAN

-----------------------------------------------------------------------------------

 Nested Loop  (cost=30.85..80.50 rows=6 width=244)

   ->  HashAggregate  (cost=30.57..30.63 rows=6 width=4)

         Group Key: (unnest(foo.f1))

         ->  HashAggregate  (cost=30.42..30.49 rows=6 width=4)

               Group Key: unnest(foo.f1)

               ->  ProjectSet  (cost=0.00..28.92 rows=600 width=4)

                     ->  Seq Scan on foo  (cost=0.00..25.88 rows=6 width=32)

                           Filter: (f2 = 1)

   ->  Index Scan using tenk1_unique2 on tenk1  (cost=0.29..8.30 rows=1 width=244)

         Index Cond: (unique2 = (unnest(foo.f1)))

(10 rows)

Digging into it, the reason for the duplicate HashAggregate step was that

query_supports_distinctness() punted on SRFs-in-the-targetlist, basically

on the argument that it wasn't worth extra work to handle that case.

Thinking a bit harder, it seems to me that the correct analysis is:

1. If we are proving distinctness on the grounds of a DISTINCT clause,

then it doesn't matter whether there are any SRFs, because DISTINCT

removes duplicates after tlist SRF expansion.

2. But tlist SRFs break the ability to prove distinctness on the grounds

of GROUP BY, unless all of them are within grouping columns.

It still seems like detecting the second case is harder than it's worth,

but we can trivially handle the first case, with little more than some

code rearrangement.

The other problem is that the output rowcount of the sub-select (ie, of

the HashAggregate) is being estimated as though the SRF weren't there.

This turns out to be because estimate_num_groups() doesn't consider the

possibility of SRFs in the grouping columns.  It never has, but in 9.6 and

before the problem was masked by the fact that grouping_planner scaled up

the result rowcount by tlist_returns_set_rows() *after* performing

grouping.  Now we're effectively doing that in the other order, which is

more correct, but that means estimate_num_groups() has to apply some sort

of adjustment.  I suggest that it just multiply its old estimate by the

maximum of the SRF expansion counts.  That's likely to be an overestimate,

but it's really hard to do better without specific knowledge of the

individual SRF's behavior.

In short, I propose the attached fixes.  I've checked this and it seems

to fix Dmitry's original problem according to the test case he sent

off-list.

                        regards, tom lane