Re: select DISTINCT

Kevin Grittner <kgrittn@xxxxxxxxx> · Fri, 6 Sep 2013 17:51:06 -0700 (PDT)

pg noob <pgnube@xxxxxxxxx> wrote:

> The GROUP BY performs much better than DISTINCT even though both
> these two queries return the exact same count result.

No, GROUP BY performs much better than count(DISTINCT colname).

To confirm that this isn't something that has changed in the four
years since 8.4 was released, I used the latest source code.

test=# explain analyze select count(column1) from table1;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=273571.41..273571.42 rows=1 width=4) (actual time=1704.584..1704.584 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.356..777.089 rows=16163937 loops=1)
 Total runtime: 1704.624 ms
(3 rows)

It is no surprise that just incrementing a counter for each row
scanned is faster than maintaining a collection of values seen so
far, looking up each value to see whether it is in the collection,
and inserting it if not, or sorting the list of values and counting
transitions.  Either method of getting a count of distinct values
is going to be a lot slower than just counting rows.

For your use of DISTINCT within the aggregate function, I removed
the unnecessary parentheses for clarity.  It's the same plan and
speed either way.

test=# explain analyze select count(distinct column1) from table1;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=273571.41..273571.42 rows=1 width=4) (actual time=10553.480..10553.480 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.017..880.412 rows=16163937 loops=1)
 Total runtime: 10553.517 ms
(3 rows)

Note that the above is very different from using DISTINCT outside
of the aggregate function:

test=# explain analyze select count(*) from (select distinct column1 from table1) foo;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=273627.68..273627.70 rows=1 width=0) (actual time=2637.148..2637.148 rows=1 loops=1)
   ->  HashAggregate  (cost=273571.41..273596.42 rows=2501 width=4) (actual time=2636.867..2637.034 rows=2501 loops=1)
         ->  Seq Scan on table1  (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.027..802.792 rows=16163937 loops=1)
 Total runtime: 2637.213 ms
(4 rows)

Which is the same plan as the GROUP BY:

test=# explain analyze select count(foo.column1) from (select column1 from table1 group by column1) as foo;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=273627.68..273627.70 rows=1 width=4) (actual time=2649.196..2649.196 rows=1 loops=1)
   ->  HashAggregate  (cost=273571.41..273596.42 rows=2501 width=4) (actual time=2648.903..2649.073 rows=2501 loops=1)
         ->  Seq Scan on table1  (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.029..807.184 rows=16163937 loops=1)
 Total runtime: 2649.264 ms
(4 rows)

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general