Re: Performance issue post upgrade on Version 13 - Incorrect Estimation Cost choosing Hash Aggregate-Nested Left Loop Join

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxxx> · Thu, 24 Mar 2022 12:55:41 +0100

On 3/22/22 13:57, Prajna Shetty wrote:
> ++ _pgsql-performance@postgresql.org_
> <https://www.postgresql.org/list/pgsql-performance/>
>  
> Hello Team,
>  
> There is change in query plan in 12.4 version and Version 13 resulting
> in performance slowness post upgrade.
>  
> 
>   * In 12.4 version, Sort Operation Group Aggregate is selected which
>     results to Merge Join. Query takes ~5 seconds.
>   * In 13.5 version, optimizer wrongly estimates and due to new Disk
>     Based Hash Aggregate feature, it prefers Hash Aggregate instead of
>     Sort Operation which finally blocks merge-join and chooses Nested
>     Loop Left Join. Query takes ~5 minutes.
> 
>  
> *_NOTE: _*Disabling Hash Aggregate on instance level forces optimizer to
> choose merge operation but such instance level modification is not
> possible in terms of Application Functionality.
>  
> This performance issue is on all over most of queries. Attached one of
> the query and its plan in both version for reference in case that helps
> for recreating the issue.
>  

It's impossible to comment those other queries, but chances are the root
cause is the same.

> Version 13 query plan has lower estimated cost than that of 12.4 which
> implies 13.5 planner thought it found a better plan, but it is running
> slower and actual cost show more.
>  
> 12.4 Version:
> "Merge Right Join  (cost=*202198.78..295729.10* rows=1 width=8) (actual
> time=1399.727..*5224.574* rows=296 loops=1)"
>  
> 13.5 version:-
> "Nested Loop Left Join  (cost=*196360.90..287890.45* rows=1 width=8)
> (actual time=3209.577..*371300.693* rows=296 loops=1)"
>  

This is not a costing issue, the problem is that we expect 1 row and
calculate the cost for that, but then get 296. And unfortunately a
nested loop degrades much faster than a merge join.

I'm not sure why exactly 12.4 picked a merge join, chances are the
costing formular changed a bit somewhere. But as I said, the problem is
in bogus row cardinality estimates - 12.4 is simply lucky.

The problem most likely stems from this part:

  ->  GroupAggregate  (cost=0.43..85743.24 rows=1830 width=72) (actual
time=1.621..3452.034 rows=282179 loops=3)
  Group Key: student_class_detail.aamc_id
  Filter: (max((student_class_detail.class_level_cd)::text) = '4'::text)
  Rows Removed by Filter: 76060
  ->  Index Scan using uk_student_class_detail_aamcid_classlevelcd on
student_class_detail  (cost=0.43..74747.61 rows=1284079 width=6) (actual
time=1.570..2723.014 rows=1272390 loops=3)
    Filter: (class_level_start_dt IS NOT NULL)
    Rows Removed by Filter: 160402

The filter is bound to be misestimated, and the error then snowballs.
Try replacing this part with a temporary table (with pre-aggregated
results) - you can run analyze on it, etc. I'd bet that'll make the
issue go away.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company