Re: Potential performance issues related to group by and covering index

David Rowley <dgrowleyml@xxxxxxxxx> · Tue, 2 Mar 2021 22:49:20 +1300

On Tue, 2 Mar 2021 at 21:53, Liu, Xinyu <liuxy@xxxxxxxxxx> wrote:
> *Expected Behavior
>
> Since these two queries are semantically equivalent, we were hoping that PostgreSQL would evaluate them in roughly the same amount of time.
> It looks to me that different order of group by clauses triggers different plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the covering index, it will trigger an index scan on associated columns;
> however,  when the group by clauses have different order than the covering index (ps_suppkey, ps_partkey), the index scan will not be triggered.
> Given that the user might not pay close attention to this subtle difference, I was wondering if it is worth making these two queries have the same and predictable performance on Postgresql.

Unfortunately, it would take a pretty major overhaul of the query
planner to do that efficiently.

For now, have a few smarts involved in trying to make the GROUP BY
processing more efficient:

1) We remove columns from the GROUP BY if they're functionally
dependent on the primary key, providing the primary key is present
too. (you're seeing this in your example query)
2) We also change the order of the GROUP BY columns if it's a subset
of the ORDER BY columns.  This is quite good as we'd do grouping by
{b,a} if someone wrote GROUP BY a,b ORDER BY b,a; to which would save
having to re-sort the data for the ORDER BY after doing the GROUP BY.
That's especially useful for queries with a LIMIT clause.

If we want to do anything much smarter than that like trying every
combination of the GROUP BY clause, then plan times are likely going
to explode. The join order search is done based on the chosen query
pathkeys, which in many queries is the pathkeys for the GROUP BY
clause (see standard_qp_callback()). This means throughout the join
search, planner will try and form paths that provide pre-sorted input
that allows the group by to be implemented efficiently with pre-sorted
data. You might see Merge Joins rather than Hash Joins, for example.

If we want to try every combination of the GROUP BY columns then it
means repeating that join search once per combination. The join search
is often, *by far*, the most expensive part of planning a query.

While it would be nice if the planner did a better job on selecting
the best order for group by columns, unless we can come up with some
heuristics that allow us to just try a single combination that is
likely good, then I don't think anyone would thank us for slowing down
the planner by a factor of the number of possible combinations of the
group by columns.

David