On Tue, 2 Mar 2021 at 21:53, Liu, Xinyu <liuxy@xxxxxxxxxx> wrote: > *Expected Behavior > > Since these two queries are semantically equivalent, we were hoping that PostgreSQL would evaluate them in roughly the same amount of time. > It looks to me that different order of group by clauses triggers different plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the covering index, it will trigger an index scan on associated columns; > however, when the group by clauses have different order than the covering index (ps_suppkey, ps_partkey), the index scan will not be triggered. > Given that the user might not pay close attention to this subtle difference, I was wondering if it is worth making these two queries have the same and predictable performance on Postgresql. Unfortunately, it would take a pretty major overhaul of the query planner to do that efficiently. For now, have a few smarts involved in trying to make the GROUP BY processing more efficient: 1) We remove columns from the GROUP BY if they're functionally dependent on the primary key, providing the primary key is present too. (you're seeing this in your example query) 2) We also change the order of the GROUP BY columns if it's a subset of the ORDER BY columns. This is quite good as we'd do grouping by {b,a} if someone wrote GROUP BY a,b ORDER BY b,a; to which would save having to re-sort the data for the ORDER BY after doing the GROUP BY. That's especially useful for queries with a LIMIT clause. If we want to do anything much smarter than that like trying every combination of the GROUP BY clause, then plan times are likely going to explode. The join order search is done based on the chosen query pathkeys, which in many queries is the pathkeys for the GROUP BY clause (see standard_qp_callback()). This means throughout the join search, planner will try and form paths that provide pre-sorted input that allows the group by to be implemented efficiently with pre-sorted data. You might see Merge Joins rather than Hash Joins, for example. If we want to try every combination of the GROUP BY columns then it means repeating that join search once per combination. The join search is often, *by far*, the most expensive part of planning a query. While it would be nice if the planner did a better job on selecting the best order for group by columns, unless we can come up with some heuristics that allow us to just try a single combination that is likely good, then I don't think anyone would thank us for slowing down the planner by a factor of the number of possible combinations of the group by columns. David