Re: Potential performance issues related to group by and covering index

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





út 2. 3. 2021 v 9:53 odesílatel Liu, Xinyu <liuxy@xxxxxxxxxx> napsal:

Hello,


We have 2 TPC-H queries which fetch the same tuples but have significant query execution time differences (4.3 times).


We are sharing a pair of TPC-H queries that exhibit this performance difference:


First query:

SELECT "ps_comment", 

       "ps_suppkey", 

       "ps_supplycost", 

       "ps_partkey", 

       "ps_availqty" 

FROM   "partsupp" 

WHERE  "ps_partkey" + 16 < 1 

        OR "ps_partkey" = 2 

GROUP  BY "ps_partkey", 

          "ps_suppkey", 

          "ps_availqty", 

          "ps_supplycost", 

          "ps_comment" 


Second query:

SELECT "ps_comment", 

       "ps_suppkey", 

       "ps_supplycost", 

       "ps_partkey", 

       "ps_availqty" 

FROM   "partsupp" 

WHERE  "ps_partkey" + 16 < 1 

        OR "ps_partkey" = 2 

GROUP  BY "ps_comment", 

          "ps_suppkey", 

          "ps_supplycost", 

          "ps_partkey", 

          "ps_availqty" 


* Actual Behavior

We executed both queries on the TPC-H benchmark of scale factor 5: the first query takes over 1.7 seconds, while the second query only takes 0.4 seconds.
We think the time difference results from different plans selected. Specifically, in the first (slow) query, the DBMS performs an index scan on table partsupp using the covering index (ps_partkey, ps_suppkey), while the second (fast) query performs a parallel scan on (ps_suppkey, ps_partkey).


*  Query Execution Plan

  • First query:

                                                       QUERY PLAN                                                                  

----------------------------------------------------------------------------------------------------------------------------------------------

 Group  (cost=0.43..342188.58 rows=399262 width=144) (actual time=0.058..1737.659 rows=4 loops=1)

   Group Key: ps_partkey, ps_suppkey

   Buffers: shared hit=123005 read=98055

   ->  Index Scan using partsupp_pkey on partsupp  (cost=0.43..335522.75 rows=1333167 width=144) (actual time=0.055..1737.651 rows=4 loops=1)

         Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2))

         Rows Removed by Filter: 3999996

         Buffers: shared hit=123005 read=98055

 Planning Time: 0.926 ms

 Execution Time: 1737.754 ms

(9 rows)



In this case there is brutal overestimation. Probably due  planner unfriendly written predicate ps_partkey + 16  < 1) OR ps_parkey = 2. You can try to rewrite this predicate to ps_parthkey < -15 OR ps_parkey = 2

Regards

Pavel





  • Second query:

                                                                                                                                          QUERY PLAN                                                                     

---------------------------------------------------------------------------------------------------------------------------------------------------

 Group  (cost=250110.68..350438.93 rows=399262 width=144) (actual time=400.353..400.361 rows=4 loops=1)

   Group Key: ps_suppkey, ps_partkey

   Buffers: shared hit=5481 read=24093

   ->  Gather Merge  (cost=250110.68..346446.31 rows=798524 width=144) (actual time=400.351..406.741 rows=4 loops=1)

         Workers Planned: 2

         Workers Launched: 2

         Buffers: shared hit=15151 read=72144

         ->  Group  (cost=249110.66..253276.80 rows=399262 width=144) (actual time=395.882..395.883 rows=1 loops=3)

               Group Key: ps_suppkey, ps_partkey

               Buffers: shared hit=15151 read=72144

               ->  Sort  (cost=249110.66..250499.37 rows=555486 width=144) (actual time=395.880..395.881 rows=1 loops=3)

                     Sort Key: ps_suppkey, ps_partkey

                     Sort Method: quicksort  Memory: 25kB

                     Worker 0:  Sort Method: quicksort  Memory: 25kB

                     Worker 1:  Sort Method: quicksort  Memory: 25kB

                     Buffers: shared hit=15151 read=72144

                     ->  Parallel Seq Scan on partsupp  (cost=0.00..116363.88 rows=555486 width=144) (actual time=395.518..395.615 rows=1 loops=3)

                           Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2))

                           Rows Removed by Filter: 1333332

                           Buffers: shared hit=15065 read=72136

 Planning Time: 0.360 ms

 Execution Time: 406.880 ms

(22 rows)







*Expected Behavior

Since these two queries are semantically equivalent, we were hoping that PostgreSQL would evaluate them in roughly the same amount of time.
It looks to me that different order of group by clauses triggers different plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the covering index, it will trigger an index scan on associated columns;
however,  when the group by clauses have different order than the covering index (ps_suppkey, ps_partkey), the index scan will not be triggered.
Given that the user might not pay close attention to this subtle difference, I was wondering if it is worth making these two queries have the same and predictable performance on Postgresql.


*Test Environment

Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"

PostgreSQL v12.3

Database: TPC-H benchmark (with scale factor 5)

The description of table partsupp is as follows:

tpch5=# \d partsupp;

                         Table "public.partsupp"

    Column     |          Type          | Collation | Nullable | Default 

---------------+------------------------+-----------+----------+---------

 ps_partkey    | integer                |           | not null | 

 ps_suppkey    | integer                |           | not null | 

 ps_availqty   | integer                |           | not null | 

 ps_supplycost | numeric(15,2)          |           | not null | 

 ps_comment    | character varying(199) |           | not null | 

Indexes:

    "partsupp_pkey" PRIMARY KEY, btree (ps_partkey, ps_suppkey)

Foreign-key constraints:

    "partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES supplier(s_suppkey)

    "partsupp_fk2" FOREIGN KEY (ps_partkey) REFERENCES part(p_partkey)

Referenced by:

    TABLE "lineitem" CONSTRAINT "lineitem_fk2" FOREIGN KEY (l_partkey, l_suppkey) REFERENCES partsupp(ps_partkey, ps_suppkey)






*Here are the steps for reproducing our observations:

  1. Download the dataset from the link: https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing

  2. Set up TPC-H benchmark

tar xzvf tpch5_postgresql.tar.gz

cd tpch5_postgresql

db=tpch5

createdb $db

psql -d $db < dss.ddl

for i in `ls *.tbl`

do

    echo $i

    name=`echo $i|cut -d'.' -f1`

    psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';"

done

psql -d $db < dss_postgres.ri

  1. Execute the queries



[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux