Re: Performance regressions found using sqlfuzz

Andrew Gierth <andrew@xxxxxxxxxxxxxxxxxxxx> · Fri, 15 Feb 2019 16:20:07 +0000

>>>>> "Jung" == Jung, Jinho <jinho.jung@xxxxxxxxxx> writes:

 Jung>   select distinct
 Jung>     ref_0.i_im_id as c0,
 Jung>     ref_1.ol_dist_info as c1
 Jung>   from
 Jung>       public.item as ref_0 right join
 Jung>       public.order_line as ref_1
 Jung>       on (ref_0.i_id = 5)

 Jung> - Commit: 84f9a35 (Improve estimate of distinct values in estimate_num_groups())

 Jung> - Our analysis: We believe that this regression is related to the
 Jung> new logic for estimating the number of distinct values in the
 Jung> optimizer. This is affecting even queries with point lookups
 Jung> (ref_0.i_id = 5) in the TPC-C benchmark.

So what's happening here is that the old plan was mis-estimating the
result, believed incorrectly that it would fit into work_mem, and
generated a hashaggregate plan accordingly; it ran fast because
hashaggregate doesn't spill to disk but silently overflows work_mem.

The new plan correctly estimates the result size, and therefore is
forbidden from generating the hashaggregate plan at the default work_mem
setting; it generates a sort plan, and the sort of course spills to disk
since work_mem is exceeded.

Had the value of work_mem been set to something appropriate for the
workload, then the query plan would not have changed.

So the problem (from an automated testing perspective) is that an actual
_improvement_ in the code is being reported as a regression.

 Jung> ####### QUERY 3:

 Jung>   select
 Jung>     cast(ref_1.ol_i_id as int4) as c0
 Jung>   from
 Jung>     public.stock as ref_0
 Jung>       left join public.order_line as ref_1
 Jung>       on (ref_1.ol_number is not null)
 Jung>   where ref_1.ol_number is null

 Jung> - Commit: 77cd477 (Enable parallel query by default.)

 Jung> - Our analysis: We believe that this regression is due to
 Jung> parallel queries being enabled by default. Surprisingly, we found
 Jung> that even on a larger TPC-C database (scale factor of 50, roughly
 Jung> 4GB), parallel scan is still slower than the non-parallel one in
 Jung> the old version, when the query is not returning any tuples.

The problem here is not actually with parallel scans as such, but rather
the omission of a Materialize node in the parallel plan, and what looks
like some rather serious mis-costing of the nestloop antijoin.

 Jung> ####### QUERY 4:

 Jung>   select
 Jung>     ref_0.s_dist_06 as c0
 Jung>   from
 Jung>     public.stock as ref_0
 Jung>   where (ref_0.s_w_id < cast(least(0, 1) as int8))

 Jung> - Commit: 5edc63b (Account for the effect of lossy pages when costing bitmap scans)

 Jung> - Our analysis: We believe that this regression has to do with
 Jung> two factors: 1) conditional expression (e.g., LEAST or NULLIF)
 Jung> are not reduced to constants unlike string functions (e.g.,
 Jung> CHAR_LENGTH) 2) change in the cost estimation function for bitmap
 Jung> scan. Execution time grows by 3 orders of magnitude. We note that
 Jung> this regression is only observed on large databases (e.g., scale
 Jung> factor of 50).

Again, this is showing up because of a large database and a small
work_mem. The bitmap scan on stock only becomes lossy if the number of
rows matched in the index is very large relative to work_mem; the lack
of plan-time evaluation of LEAST means that the planner doesn't have any
good way to estimate the selectivity, so it's taking a default estimate.

 Jung> ####### QUERY 1:

 Jung>   select
 Jung>     ref_0.o_d_id as c0
 Jung>   from
 Jung>     public.oorder as ref_0
 Jung>   where EXISTS (
 Jung>     select
 Jung>       1
 Jung>     from
 Jung>       (select distinct
 Jung>           ref_0.o_entry_d as c0,
 Jung>           ref_1.c_credit as c1
 Jung>        from
 Jung>           public.customer as ref_1
 Jung>        where (false)
 Jung>       ) as subq_1
 Jung>   );

 Jung> - Commit: bf6c614 (Do execGrouping.c via expression eval machinery, take two)

 Jung> - Our analysis: We are not sure about the root cause of this
 Jung> regression. This might have to do with grouping logic.

What this query is basically exercising is how fast one can do
ExecReScan on a DISTINCT query, without also considering the performance
effects of actually doing the grouping (the constant-false qual here
means that the grouping comparison is never actually performed). An
optimization tradeoff that speeds up comparisons within a scan at the
cost of a fixed overhead for the scan will therefore make this query
slower, but it still seems a good tradeoff to make (of course it would
be even better to make the overhead per-query rather than per-scan, and
there were other issues with this commit that should have been caught at
the time).

-- 
Andrew (irc:RhodiumToad)