On 3 March 2017 at 18:26, George Neuner <gneuner2@xxxxxxxxxxx> wrote: > I know most people here don't pay much - or any - attention to > SQLServer, however there was an interesting article recently regarding > significant performance differences between DISTINCT and GROUP BY as > used to remove duplicates. > > https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct > > > Now I'm wondering if something similar might be lurking in Postgresql? Yes things lurk there in PostgreSQL too. But to be honest I find the examples in the URL you included a bit strange. There's almost certainly going to be a table called "orders" that you'd use for the outer part of the query. In that case the orderid would already be unique. To do the same in PostgreSQL you'd just use: select orderid, string_agg(description,'|') from orderitems group by orderid; assuming all orders had at least one line, you'd get the same result. In more general terms, PostgreSQL will allow you to GROUP BY and non-aggregated columns which are functionally dependent on the GROUP BY clause, for example: SELECT parts.partcode,parts.description,sum(sales.quantity) from sales inner join parts on sales.partcode = parts.partcode GROUP BY parts.partcode; Assuming that parts.partcode is the PRIMARY KEY of parts, this query is legal in PostgreSQL. In some other databases, and I believe SQL Server might be one of them, you would have been forced to include part.description in the GROUP BY clause. Since PostgreSQL 9.6, if you'd have done the same with that, internally the database would ignore the parts.description in the GROUP BY clause, as its smart enough to know that including parts.description in the clause is not going to change anything as the description is always the same for each parts.partcode, and no two records can share the same partcode. There's no such optimisation when it comes to DISTINCT. In PostgreSQL as of today DISTINCT is a bit naive, and will just uniquify the results on each column in the select clause. Although quite possibly the same optimisation could apply to DISTINCT too, just nobody has thought to add it yet. In short, the main difference is going to be the fewer columns you're using to identify the groups the better. If you included all of the columns in the GROUP BY clause as you put in the select list with the DISTINCT query then in most cases the performance would be the same. I believe the only exception to this is in regards to parallel query, as currently only GROUP BYs may be parallelised, not DISTINCT. Historically with older now unsupported versions of PostgreSQL (pre 8.4) you may have also preferred to use GROUP BY over DISTINCT as GROUP BY could be implemented internally by sorting or hashing the results, whereas DISTINCT used to only be implemented by Sorting the results. Although this has long since been the case. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general