Hi David, On Sat, 4 Mar 2017 02:32:48 +1300, David Rowley <david.rowley@xxxxxxxxxxxxxxx> wrote: >On 3 March 2017 at 18:26, George Neuner <gneuner2@xxxxxxxxxxx> wrote: >> I know most people here don't pay much - or any - attention to >> SQLServer, however there was an interesting article recently regarding >> significant performance differences between DISTINCT and GROUP BY as >> used to remove duplicates. >> >> https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct >> >> >> Now I'm wondering if something similar might be lurking in Postgresql? > >Yes things lurk there in PostgreSQL too. But to be honest I find the >examples in the URL you included a bit strange. There's almost >certainly going to be a table called "orders" that you'd use for the >outer part of the query. In that case the orderid would already be >unique. To do the same in PostgreSQL you'd just use: select orderid, >string_agg(description,'|') from orderitems group by orderid; assuming >all orders had at least one line, you'd get the same result. The author mentioned at the beginning that the simple queries: SELECT DISTINCT Description FROM Sales.OrderLines SELECT Description FROM Sales.OrderLines GROUP BY Description; wouldn't display the subject behavior. Of course, analyzing the much more complex queries is much more difficult. It begs the question: what actually is going on there? But I don't use SQL Server ... my interest is in how Postgresql deals with a similar situation. >Assuming that parts.partcode is the PRIMARY KEY of parts, this query >is legal in PostgreSQL. In some other databases, and I believe SQL >Server might be one of them, you would have been forced to include >part.description in the GROUP BY clause. Since PostgreSQL 9.6, if >you'd have done the same with that, internally the database would >ignore the parts.description in the GROUP BY clause, as its smart >enough to know that including parts.description in the clause is not >going to change anything as the description is always the same for >each parts.partcode, and no two records can share the same partcode. > >There's no such optimisation when it comes to DISTINCT. In PostgreSQL >as of today DISTINCT is a bit naive, and will just uniquify the >results on each column in the select clause. Although quite possibly >the same optimisation could apply to DISTINCT too, just nobody has >thought to add it yet. > >In short, the main difference is going to be the fewer columns you're >using to identify the groups the better. If you included all of the >columns in the GROUP BY clause as you put in the select list with the >DISTINCT query then in most cases the performance would be the same. I >believe the only exception to this is in regards to parallel query, as >currently only GROUP BYs may be parallelised, not DISTINCT. > >Historically with older now unsupported versions of PostgreSQL (pre >8.4) you may have also preferred to use GROUP BY over DISTINCT as >GROUP BY could be implemented internally by sorting or hashing the >results, whereas DISTINCT used to only be implemented by Sorting the >results. Although this has long since been the case. I often have occasion to use multiple mapping relations: e.g., A{1}->B{N} C{1}->B{N} together in a query where C is provided and I need to find the corresponding A(s). Frequently these queries result in the same A being found multiple times. Although the mapping tuples are small [usually just a pair of keys], the number of rows in the mapping tables may be very large, and a given query may need to join/work its way through several such mappings. Typically in such situations, I divide the query using CTEs and (try to) minimize the volume of data at each step by filtering duplicates from any results that might include them. I have always used DISTINCT to filter duplication, reserving GROUP BY for aggregations (counting, etc.). But if I understand correctly, you are saying that GROUP BY should be preferred even for the simpler use. George -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general