Re: DISTINCT vs GROUP BY - was Re: is (not) distinct from

George Neuner <gneuner2@xxxxxxxxxxx> · Sat, 04 Mar 2017 02:36:14 -0500

Hi David,

On Sat, 4 Mar 2017 02:32:48 +1300, David Rowley
<david.rowley@xxxxxxxxxxxxxxx> wrote:

>On 3 March 2017 at 18:26, George Neuner <gneuner2@xxxxxxxxxxx> wrote:
>> I know most people here don't pay much - or any - attention to
>> SQLServer, however there was an interesting article recently regarding
>> significant performance differences between DISTINCT and GROUP BY as
>> used to remove duplicates.
>>
>> https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct
>>
>>
>> Now I'm wondering if something similar might be lurking in Postgresql?
>
>Yes things lurk there in PostgreSQL too. But to be honest I find the
>examples in the URL you included a bit strange. There's almost
>certainly going to be a table called "orders" that you'd use for the
>outer part of the query. In that case the orderid would already be
>unique.  To do the same in PostgreSQL you'd just use: select orderid,
>string_agg(description,'|') from orderitems group by orderid; assuming
>all orders had at least one line, you'd get the same result.

The author mentioned at the beginning that the simple queries:

  SELECT DISTINCT Description FROM Sales.OrderLines
  SELECT Description FROM Sales.OrderLines GROUP BY Description;

wouldn't display the subject behavior.

Of course, analyzing the much more complex queries is much more
difficult.  It begs the question: what actually is going on there?

But I don't use SQL Server ... my interest is in how Postgresql deals
with a similar situation.

>Assuming that parts.partcode is the PRIMARY KEY of parts, this query
>is legal in PostgreSQL. In some other databases, and I believe SQL
>Server might be one of them, you would have been forced to include
>part.description in the GROUP BY clause. Since PostgreSQL 9.6, if
>you'd have done the same with that, internally the database would
>ignore the parts.description in the GROUP BY clause, as its smart
>enough to know that including parts.description in the clause is not
>going to change anything as the description is always the same for
>each parts.partcode, and no two records can share the same partcode.
>
>There's no such optimisation when it comes to DISTINCT. In PostgreSQL
>as of today DISTINCT is a bit naive, and will just uniquify the
>results on each column in the select clause.  Although quite possibly
>the same optimisation could apply to DISTINCT too, just nobody has
>thought to add it yet.
>
>In short, the main difference is going to be the fewer columns you're
>using to identify the groups the better. If you included all of the
>columns in the GROUP BY clause as you put in the select list with the
>DISTINCT query then in most cases the performance would be the same. I
>believe the only exception to this is in regards to parallel query, as
>currently only GROUP BYs may be parallelised, not DISTINCT.
>
>Historically with older now unsupported versions of PostgreSQL (pre
>8.4) you may have also preferred to use GROUP BY over DISTINCT as
>GROUP BY could be implemented internally by sorting or hashing the
>results, whereas DISTINCT used to only be implemented by Sorting the
>results. Although this has long since been the case.

I often have occasion to use multiple mapping relations: e.g., 
  A{1}->B{N}
  C{1}->B{N}
together in a query where C is provided and I need to find the
corresponding A(s).  Frequently these queries result in the same A
being found multiple times.

Although the mapping tuples are small [usually just a pair of keys],
the number of rows in the mapping tables may be very large, and a
given query may need to join/work its way through several such
mappings.

Typically in such situations, I divide the query using CTEs and (try
to) minimize the volume of data at each step by filtering duplicates
from any results that might include them.

I have always used DISTINCT to filter duplication, reserving GROUP BY
for aggregations (counting, etc.).  But if I understand correctly, you
are saying that GROUP BY should be preferred even for the simpler use.

George

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general