Search Postgresql Archives

Re: DISTINCT vs GROUP BY - was Re: is (not) distinct from

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03.03.2017 06:26, George Neuner wrote:
I know most people here don't pay much - or any - attention to
SQLServer, however there was an interesting article recently regarding
significant performance differences between DISTINCT and GROUP BY as
used to remove duplicates.

https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct

On a similar note, this is also an interesting read about the topic of distinct vs group by:

https://blogs.oracle.com/developer/entry/counting_with_oracle_is_faster

Interesting is the performance difference between integers and strings for PostgreSQL which doesn't exist for Oracle.

I also tried rewriting "select distinct" to "select group by" using PostgreSQL. It didn't help; it was even worse (see appendix).

I'll get around to doing some testing soon.  For now, I am just asking
if anyone has ever run into something like this?

Yes, my team did. We use Django on a daily basis to generate SQL queries. In case of model-spanning queries, a lot of joining and duplications are involved. Distinct is the "generally" accepted way to remedy the situation but it's actually more like Tom said: distinct is a band-aid here. UNIONS and SUBSELECTs would be better I guess.


Sven


** Appendix **


>>>># \d docs
                         Table "public.docs"
 Column |  Type   |                     Modifiers                    
--------+---------+---------------------------------------------------
 id     | integer | not null default nextval('docs_id_seq'::regclass)
 meta   | jsonb   |
Indexes:
    "docs_pkey" PRIMARY KEY, btree (id)


>>>># explain analyze select count(distinct meta->>'blood_group') from docs;
                                                         QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=760497.00..760497.01 rows=1 width=449) (actual time=37631.727..37631.727 rows=1 loops=1)
   ->  Seq Scan on docs  (cost=0.00..710497.00 rows=10000000 width=449) (actual time=0.500..3999.417 rows=10000000 loops=1)
 Planning time: 0.211 ms
 Execution time: 37631.829 ms
(4 rows)


>>>># explain analyze select count(*) from (select meta->>'blood_group' from docs group by meta->>'blood_group') as x;
                                                               QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4441923.83..4441923.84 rows=1 width=0) (actual time=41189.472..41189.472 rows=1 loops=1)
   ->  Group  (cost=4241923.83..4316923.83 rows=10000000 width=449) (actual time=31303.690..41189.455 rows=8 loops=1)
         Group Key: ((docs.meta ->> 'blood_group'::text))
         ->  Sort  (cost=4241923.83..4266923.83 rows=10000000 width=449) (actual time=31303.686..40475.227 rows=10000000 loops=1)
               Sort Key: ((docs.meta ->> 'blood_group'::text))
               Sort Method: external merge  Disk: 129328kB
               ->  Seq Scan on docs  (cost=0.00..735497.00 rows=10000000 width=449) (actual time=0.349..6433.691 rows=10000000 loops=1)
 Planning time: 2.189 ms
 Execution time: 41203.669 ms
(9 rows)


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux