Hello,
We have 2 TPC-H queries which fetch the same tuples but have significant query execution time differences (4.3 times).
We are sharing a pair of TPC-H queries that exhibit this performance difference:
First query:
SELECT "ps_comment",
"ps_suppkey",
"ps_supplycost",
"ps_partkey",
"ps_availqty"
FROM "partsupp"
WHERE "ps_partkey" + 16 < 1
OR "ps_partkey" = 2
GROUP BY "ps_partkey",
"ps_suppkey",
"ps_availqty",
"ps_supplycost",
"ps_comment"
Second query:
SELECT "ps_comment",
"ps_suppkey",
"ps_supplycost",
"ps_partkey",
"ps_availqty"
FROM "partsupp"
WHERE "ps_partkey" + 16 < 1
OR "ps_partkey" = 2
GROUP BY "ps_comment",
"ps_suppkey",
"ps_supplycost",
"ps_partkey",
"ps_availqty"
* Actual Behavior
We executed both queries on the TPC-H benchmark of scale factor 5: the first query takes over 1.7 seconds, while the second query only takes 0.4 seconds.
We think the time difference results from different plans selected. Specifically, in the first (slow) query, the DBMS performs an index scan on table partsupp using the covering index (ps_partkey, ps_suppkey), while the second (fast) query performs a parallel scan on (ps_suppkey, ps_partkey).
* Query Execution Plan
First query:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Group (cost=0.43..342188.58 rows=399262 width=144) (actual time=0.058..1737.659 rows=4 loops=1)
Group Key: ps_partkey, ps_suppkey
Buffers: shared hit=123005 read=98055
-> Index Scan using partsupp_pkey on partsupp (cost=0.43..335522.75 rows=1333167 width=144) (actual time=0.055..1737.651 rows=4 loops=1)
Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2))
Rows Removed by Filter: 3999996
Buffers: shared hit=123005 read=98055
Planning Time: 0.926 ms
Execution Time: 1737.754 ms
(9 rows)
Second query:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Group (cost=250110.68..350438.93 rows=399262 width=144) (actual time=400.353..400.361 rows=4 loops=1)
Group Key: ps_suppkey, ps_partkey
Buffers: shared hit=5481 read=24093
-> Gather Merge (cost=250110.68..346446.31 rows=798524 width=144) (actual time=400.351..406.741 rows=4 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15151 read=72144
-> Group (cost=249110.66..253276.80 rows=399262 width=144) (actual time=395.882..395.883 rows=1 loops=3)
Group Key: ps_suppkey, ps_partkey
Buffers: shared hit=15151 read=72144
-> Sort (cost=249110.66..250499.37 rows=555486 width=144) (actual time=395.880..395.881 rows=1 loops=3)
Sort Key: ps_suppkey, ps_partkey
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
Buffers: shared hit=15151 read=72144
-> Parallel Seq Scan on partsupp (cost=0.00..116363.88 rows=555486 width=144) (actual time=395.518..395.615 rows=1 loops=3)
Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2))
Rows Removed by Filter: 1333332
Buffers: shared hit=15065 read=72136
Planning Time: 0.360 ms
Execution Time: 406.880 ms
(22 rows)
*Expected Behavior
Since these two queries are semantically equivalent, we were hoping that PostgreSQL would evaluate them in roughly the same amount of time.
It looks to me that different order of group by clauses triggers different plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the covering index, it will trigger an index scan on associated columns;
however, when the group by clauses have different order than the covering index (ps_suppkey, ps_partkey), the index scan will not be triggered.
Given that the user might not pay close attention to this subtle difference, I was wondering if it is worth making these two queries have the same and predictable performance on Postgresql.
*Test Environment
Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux"
PostgreSQL v12.3
Database: TPC-H benchmark (with scale factor 5)
The description of table partsupp is as follows:
tpch5=# \d partsupp;
Table "public.partsupp"
Column | Type | Collation | Nullable | Default
---------------+------------------------+-----------+----------+---------
ps_partkey | integer | | not null |
ps_suppkey | integer | | not null |
ps_availqty | integer | | not null |
ps_supplycost | numeric(15,2) | | not null |
ps_comment | character varying(199) | | not null |
Indexes:
"partsupp_pkey" PRIMARY KEY, btree (ps_partkey, ps_suppkey)
Foreign-key constraints:
"partsupp_fk1" FOREIGN KEY (ps_suppkey) REFERENCES supplier(s_suppkey)
"partsupp_fk2" FOREIGN KEY (ps_partkey) REFERENCES part(p_partkey)
Referenced by:
TABLE "lineitem" CONSTRAINT "lineitem_fk2" FOREIGN KEY (l_partkey, l_suppkey) REFERENCES partsupp(ps_partkey, ps_suppkey)
*Here are the steps for reproducing our observations:
Download the dataset from the link: https://drive.google.com/file/d/13rFa1BNDi4e2RmXBn-yEQkcqt6lsBu1c/view?usp=sharing
Set up TPC-H benchmark
tar xzvf tpch5_postgresql.tar.gz
cd tpch5_postgresql
db=tpch5
createdb $db
psql -d $db < dss.ddl
for i in `ls *.tbl`
do
echo $i
name=`echo $i|cut -d'.' -f1`
psql -d $db -c "COPY $name FROM '`pwd`/$i' DELIMITER '|' ENCODING 'LATIN1';"
done
psql -d $db < dss_postgres.ri
Execute the queries