I have done another experiment for count(*) vs count(distinct ...), on same table schema but with 10000000 rows now. And for this time, the postgres version is 9.3.2 (9.3.2-1.pgdg12.4+1).
These two has same resulted query plan with same estimated cost, but count(*) is straightly fast.test=> explain analyze select count(*) from t1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169247.92..169247.93 rows=1 width=0) (actual time=37775.187..37775.188 rows=1 loops=1)
-> Seq Scan on t1 (cost=0.00..144247.94 rows=9999994 width=0) (actual time=0.037..19303.022 rows=10000000 loops=1)
Total runtime: 37775.216 ms
(3 筆資料列)
時間: 37775.493 ms
test=> explain analyze select count(distinct col_int) from t1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169247.92..169247.93 rows=1 width=4) (actual time=45883.192..45883.195 rows=1 loops=1)
-> Seq Scan on t1 (cost=0.00..144247.94 rows=9999994 width=4) (actual time=0.037..19652.540 rows=10000000 loops=1)
Total runtime: 45883.224 ms
(3 筆資料列)
時間: 45883.473 ms
test=> select count(*) from t1;
count
----------
10000000
(1 筆資料列)
時間: 602.018 ms
test=> select count(*) from t1;
count
----------
10000000
(1 筆資料列)
時間: 598.291 ms
test=> select count(*) from t1;
count
----------
10000000
(1 筆資料列)
時間: 592.439 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 筆資料列)
時間: 10311.788 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 筆資料列)
時間: 7063.156 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 筆資料列)
時間: 6899.283 ms
I don't think count(*) also uses sort since it should not be needed. But for the query planner, it seems it can not distinguish between these two now.
regards,
jacket41142
2013/12/11 jacket41142 <jacket41142@xxxxxxxxx>
Thanks very much.I think another problem is that the cost estimation isn't good enough to reflex real cost. Since we can see, from "explain analyze ...", count(distinct ...) has smallest cost between the others, but since it uses sorts, the time complexity should be higher especially for large amount of rows.
Also I think even if we can have multiple count() expressions, the optimizer should also be able to choose between use sort, HashAggregate or maybe something like linear aggregate if sorts are not needed or other methods if exist. Also this may be done as just one job for entire table of interested columns, or for each column separately.
regards,
jacket411422013/12/11 Jeff Janes <jeff.janes@xxxxxxxxx>
On Tue, Dec 10, 2013 at 9:28 AM, jacket41142 <jacket41142@xxxxxxxxx> wrote:
But when I do that:So the performance difference is not very large.test=> select distinct col_int from t1 group by col_int;
Time: 1177.936 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)
Time: 7367.476 mscount(distinct ...) always sorts, rather than using a hash, to do its work. I don't think that there is any fundamental reason that it could not be changed to allow it to use hashing, it just hasn't been done yet. It is complicated by the fact that you can have multiple count() expressions in the same query which demand sorting/grouping on different columns.Cheers,Jeff