Re: Potential performance issues related to group by and covering index

Pavel Stehule <pavel.stehule@xxxxxxxxx> · Tue, 2 Mar 2021 10:08:47 +0100

út 2. 3. 2021 v 9:53 odesílatel Liu, Xinyu <liuxy@xxxxxxxxxx> napsal:

Hello,

We have 2 TPC-H queries which fetch the same tuples but have significant query execution time differences (4.3
 times).

We are sharing a pair of TPC-H queries that exhibit this performance difference:

First query:
SELECT "ps_comment", 
       "ps_suppkey", 
       "ps_supplycost", 
       "ps_partkey", 
       "ps_availqty" 
FROM  
"partsupp" 
WHERE 
"ps_partkey" + 16
< 1 
        OR "ps_partkey" = 2 
GROUP 
BY "ps_partkey", 
          "ps_suppkey", 
          "ps_availqty", 
          "ps_supplycost", 
          "ps_comment" 

Second query:
SELECT
"ps_comment", 
       "ps_suppkey", 
       "ps_supplycost", 
       "ps_partkey", 
       "ps_availqty" 
FROM  
"partsupp" 
WHERE 
"ps_partkey" + 16
< 1 
        OR
"ps_partkey" = 2 
GROUP  BY
"ps_comment", 
          "ps_suppkey", 
          "ps_supplycost", 
          "ps_partkey", 
          "ps_availqty" 

* Actual Behavior
We executed both queries on the TPC-H benchmark of scale factor 5: the first query takes
 over 1.7 seconds, while the second query only takes 0.4 seconds. 

We think the time difference results from different plans selected. Specifically, in the first (slow) query, the DBMS performs an index scan on table partsupp using the covering index (ps_partkey, ps_suppkey), while the second (fast) query performs a parallel
 scan on (ps_suppkey, ps_partkey).

*  Query Execution Plan

First query:

                                                       QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Group  (cost=0.43..342188.58 rows=399262 width=144) (actual time=0.058..1737.659 rows=4
 loops=1)
   Group Key: ps_partkey, ps_suppkey
   Buffers: shared hit=123005 read=98055
   ->  Index Scan using partsupp_pkey on partsupp  (cost=0.43..335522.75 rows=1333167
 width=144) (actual time=0.055..1737.651 rows=4 loops=1)
         Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2))
         Rows Removed by Filter: 3999996
         Buffers: shared hit=123005 read=98055
 Planning Time: 0.926 ms
 Execution Time: 1737.754 ms
(9 rows)

In this case there is brutal overestimation. Probably due  planner unfriendly written predicate ps_partkey + 16  < 1) OR ps_parkey = 2. You can try to rewrite this predicate to ps_parthkey < -15 OR ps_parkey = 2

Regards

Pavel

Second query:

                                                                                                                                          QUERY
 PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Group  (cost=250110.68..350438.93 rows=399262 width=144) (actual time=400.353..400.361
 rows=4 loops=1)
   Group Key: ps_suppkey, ps_partkey
   Buffers: shared hit=5481 read=24093
   ->  Gather Merge  (cost=250110.68..346446.31 rows=798524 width=144) (actual time=400.351..406.741
 rows=4 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=15151 read=72144
         ->  Group  (cost=249110.66..253276.80 rows=399262 width=144) (actual time=395.882..395.883
 rows=1 loops=3)
               Group Key: ps_suppkey, ps_partkey
               Buffers: shared hit=15151 read=72144
               ->  Sort  (cost=249110.66..250499.37 rows=555486 width=144) (actual time=395.880..395.881
 rows=1 loops=3)
                     Sort Key: ps_suppkey, ps_partkey
                     Sort Method: quicksort  Memory: 25kB
                     Worker 0:  Sort Method: quicksort  Memory: 25kB
                     Worker 1:  Sort Method: quicksort  Memory: 25kB
                     Buffers: shared hit=15151 read=72144
                     ->  Parallel Seq Scan on partsupp  (cost=0.00..116363.88 rows=555486
 width=144) (actual time=395.518..395.615 rows=1 loops=3)
                           Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2))
                           Rows Removed by Filter: 1333332
                           Buffers: shared hit=15065 read=72136
 Planning Time: 0.360 ms
 Execution Time: 406.880 ms
(22 rows)

*Expected Behavior
Since these two queries are semantically equivalent, we were hoping that PostgreSQL
 would evaluate them in roughly the same amount of time. 

It looks to me that different order of group by clauses triggers different plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the covering index, it will trigger an index scan on associated columns;

however,  when the group by clauses have different order than the covering index (ps_suppkey, ps_partkey), the index scan will not be triggered.

Given that the user might not pay close attention to this subtle difference, I was wondering if it is worth making these two queries have the same and predictable performance on Postgresql.

*Test Environment
Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04
 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"
PostgreSQL v12.3
Database: TPC-H benchmark (with scale factor 5)
The description of table partsupp is as follows:
tpch5=# \d partsupp;
                         Table "public.partsupp"
    Column     |          Type          | Collation | Nullable | Default 
---------------+------------------------+-----------+----------+---------
 ps_partkey    | integer                |           | not null | 
 ps_suppkey    | integer                |           | not null | 
 ps_availqty   | integer                |           | not null | 
 ps_supplycost | numeric(15,2)          |           | not null | 
 ps_comment    | character varying(199) |           | not null | 
Indexes:
    "partsupp_pkey" PRIMARY KEY, btree (ps_partkey, ps_suppkey)
Foreign-key constraints:
    "partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES supplier(s_suppkey)
    "partsupp_fk2" FOREIGN KEY (ps_partkey) REFERENCES part(p_partkey)
Referenced by:
    TABLE "lineitem" CONSTRAINT "lineitem_fk2" FOREIGN KEY (l_partkey, l_suppkey) REFERENCES
 partsupp(ps_partkey, ps_suppkey)

*Here are the steps for reproducing our observations:

Download the dataset from the link: https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing

Set up TPC-H benchmark

tar xzvf tpch5_postgresql.tar.gz

cd tpch5_postgresql

db=tpch5

createdb $db

psql -d $db < dss.ddl

for i in `ls *.tbl`

do

    echo $i

    name=`echo $i|cut -d'.' -f1`

    psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';"

done

psql -d $db < dss_postgres.ri

Execute the queries