pg noob <pgnube@xxxxxxxxx> wrote: > The GROUP BY performs much better than DISTINCT even though both > these two queries return the exact same count result. No, GROUP BY performs much better than count(DISTINCT colname). To confirm that this isn't something that has changed in the four years since 8.4 was released, I used the latest source code. test=# explain analyze select count(column1) from table1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=273571.41..273571.42 rows=1 width=4) (actual time=1704.584..1704.584 rows=1 loops=1) -> Seq Scan on table1 (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.356..777.089 rows=16163937 loops=1) Total runtime: 1704.624 ms (3 rows) It is no surprise that just incrementing a counter for each row scanned is faster than maintaining a collection of values seen so far, looking up each value to see whether it is in the collection, and inserting it if not, or sorting the list of values and counting transitions. Either method of getting a count of distinct values is going to be a lot slower than just counting rows. For your use of DISTINCT within the aggregate function, I removed the unnecessary parentheses for clarity. It's the same plan and speed either way. test=# explain analyze select count(distinct column1) from table1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=273571.41..273571.42 rows=1 width=4) (actual time=10553.480..10553.480 rows=1 loops=1) -> Seq Scan on table1 (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.017..880.412 rows=16163937 loops=1) Total runtime: 10553.517 ms (3 rows) Note that the above is very different from using DISTINCT outside of the aggregate function: test=# explain analyze select count(*) from (select distinct column1 from table1) foo; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=273627.68..273627.70 rows=1 width=0) (actual time=2637.148..2637.148 rows=1 loops=1) -> HashAggregate (cost=273571.41..273596.42 rows=2501 width=4) (actual time=2636.867..2637.034 rows=2501 loops=1) -> Seq Scan on table1 (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.027..802.792 rows=16163937 loops=1) Total runtime: 2637.213 ms (4 rows) Which is the same plan as the GROUP BY: test=# explain analyze select count(foo.column1) from (select column1 from table1 group by column1) as foo; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=273627.68..273627.70 rows=1 width=4) (actual time=2649.196..2649.196 rows=1 loops=1) -> HashAggregate (cost=273571.41..273596.42 rows=2501 width=4) (actual time=2648.903..2649.073 rows=2501 loops=1) -> Seq Scan on table1 (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.029..807.184 rows=16163937 loops=1) Total runtime: 2649.264 ms (4 rows) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general