Re: DISTINCT vs GROUP BY - was Re: is (not) distinct from

David Rowley <david.rowley@xxxxxxxxxxxxxxx> · Sat, 4 Mar 2017 02:32:48 +1300

On 3 March 2017 at 18:26, George Neuner <gneuner2@xxxxxxxxxxx> wrote:
> I know most people here don't pay much - or any - attention to
> SQLServer, however there was an interesting article recently regarding
> significant performance differences between DISTINCT and GROUP BY as
> used to remove duplicates.
>
> https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct
>
>
> Now I'm wondering if something similar might be lurking in Postgresql?

Yes things lurk there in PostgreSQL too. But to be honest I find the
examples in the URL you included a bit strange. There's almost
certainly going to be a table called "orders" that you'd use for the
outer part of the query. In that case the orderid would already be
unique.  To do the same in PostgreSQL you'd just use: select orderid,
string_agg(description,'|') from orderitems group by orderid; assuming
all orders had at least one line, you'd get the same result.

In more general terms, PostgreSQL will allow you to GROUP BY and
non-aggregated columns which are functionally dependent on the GROUP
BY clause, for example:

SELECT parts.partcode,parts.description,sum(sales.quantity) from sales
inner join parts on sales.partcode = parts.partcode GROUP BY
parts.partcode;

Assuming that parts.partcode is the PRIMARY KEY of parts, this query
is legal in PostgreSQL. In some other databases, and I believe SQL
Server might be one of them, you would have been forced to include
part.description in the GROUP BY clause. Since PostgreSQL 9.6, if
you'd have done the same with that, internally the database would
ignore the parts.description in the GROUP BY clause, as its smart
enough to know that including parts.description in the clause is not
going to change anything as the description is always the same for
each parts.partcode, and no two records can share the same partcode.

There's no such optimisation when it comes to DISTINCT. In PostgreSQL
as of today DISTINCT is a bit naive, and will just uniquify the
results on each column in the select clause.  Although quite possibly
the same optimisation could apply to DISTINCT too, just nobody has
thought to add it yet.

In short, the main difference is going to be the fewer columns you're
using to identify the groups the better. If you included all of the
columns in the GROUP BY clause as you put in the select list with the
DISTINCT query then in most cases the performance would be the same. I
believe the only exception to this is in regards to parallel query, as
currently only GROUP BYs may be parallelised, not DISTINCT.

Historically with older now unsupported versions of PostgreSQL (pre
8.4) you may have also preferred to use GROUP BY over DISTINCT as
GROUP BY could be implemented internally by sorting or hashing the
results, whereas DISTINCT used to only be implemented by Sorting the
results. Although this has long since been the case.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general