select count(distinct ...) is slower than select distinct in about 5x

jacket41142 <jacket41142@xxxxxxxxx> · Fri, 6 Dec 2013 10:10:26 +0800

Hi

I'm just trying about PostgreSQL, I create a database "test" with a table "t1":

test=> \d t1

                            Table "public.t1"
 Column  |  Type   |                      Modifiers                      
---------+---------+-----------------------------------------------------
 col_id  | integer | not null default nextval('t1_col_id_seq'::regclass)

 col_int | integer | 
Indexes:
    "t1_pkey" PRIMARY KEY, btree (col_id)
    "t1_col_int_idx" btree (col_int)

There are 10001000 rows in that table, and basically col_int of each row are filled with random data within range [0,1024].

One strange thing I found is that:

test=> select distinct col_int from t1;
Time: 1258.627 ms
test=> select distinct col_int from t1;
Time: 1264.667 ms
test=> select distinct col_int from t1;

Time: 1261.805 ms

If I use "group by":

test=> select distinct col_int from t1 group by col_int;
Time: 1180.617 ms
test=> select distinct col_int from t1 group by col_int;
Time: 1179.849 ms

test=> select distinct col_int from t1 group by col_int;
Time: 1177.936 ms

So the performance difference is not very large.
But when I do that:

test=> select count(distinct col_int) from t1;

 count 
-------
  1025
(1 row)

Time: 7367.476 ms
test=> select count(distinct col_int) from t1;
 count 
-------
  1025
(1 row)

Time: 6946.233 ms
test=> select count(distinct col_int) from t1;

 count 
-------
  1025
(1 row)

Time: 7386.969 ms
test=> select count(distinct col_int) from t1;
 count 
-------
  1025
(1 row)

The speed is straightly worse! But it's doesn't make sense. Since we can just make a temporary table or subquery and count for all rows. The performance should be almost the same.

So do you have any idea about this? Or maybe PostgreSQL's query planner can be improved for this kinds of query?

By the way, if I use a subquery to replace the above query, here's what I got:

test=> select count(*) from (select distinct col_int from t1) as tmp;
 count 
-------
  1025
(1 row)

Time: 1267.468 ms
test=> select count(*) from (select distinct col_int from t1) as tmp;

 count 
-------
  1025
(1 row)

Time: 1257.327 ms
test=> select count(*) from (select distinct col_int from t1) as tmp;
 count 
-------
  1025
(1 row)

Time: 1258.189 ms

OK, this workaround works. But I just think if postgres can improve with this kind of query, it will be better. Also I'm not sure about other similar scenarios that will cause similar problem.

The following is the output of "explain analyze ...":

test=> explain analyze select distinct col_int from t1;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------

 HashAggregate  (cost=169268.05..169278.30 rows=1025 width=4) (actual time=39034.653..39037.239 rows=1025 loops=1)
   ->  Seq Scan on t1  (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.041..19619.931 rows=10001000 loops=1)

 Total runtime: 39039.136 ms
(3 rows)

Time: 39103.622 ms
test=> explain analyze select distinct col_int from t1 group by col_int;
                                                          QUERY PLAN                                                           

-------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=169280.86..169291.11 rows=1025 width=4) (actual time=39062.417..39064.882 rows=1025 loops=1)

   ->  HashAggregate  (cost=169268.05..169278.30 rows=1025 width=4) (actual time=39058.136..39060.303 rows=1025 loops=1)
         ->  Seq Scan on t1  (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.024..19439.482 rows=10001000 loops=1)

 Total runtime: 39066.896 ms
(4 rows)

Time: 39067.198 ms
test=> explain analyze select count(distinct col_int) from t1;
                                                       QUERY PLAN                                                        

-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=169268.05..169268.06 rows=1 width=4) (actual time=45994.120..45994.123 rows=1 loops=1)

   ->  Seq Scan on t1  (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.025..19599.950 rows=10001000 loops=1)
 Total runtime: 45994.154 ms
(3 rows)

Time: 45994.419 ms

test=> explain analyze select count(*) from (select distinct col_int from t1) as tmp;

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------

 Aggregate  (cost=169291.11..169291.12 rows=1 width=0) (actual time=39050.598..39050.600 rows=1 loops=1)
   ->  HashAggregate  (cost=169268.05..169278.30 rows=1025 width=4) (actual time=39046.814..39048.742 rows=1025 loops=1)

         ->  Seq Scan on t1  (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.035..19616.631 rows=10001000 loops=1)
 Total runtime: 39050.634 ms
(4 rows)

Time: 39050.896 ms

P.S. I have already use "analyze verbose t1;" several times so the database should already be optimized. But I'm just new to PostgreSQL.

The environment I use is:
PostgreSQL 9.3.1 (postgresql-9.3 9.3.1-1.pgdg12.4+1) on local machine (actually a vbox VM, but when I try the test, I didn't run something very different or some heavy program for all queries. And the result seems consistent.)

Ubuntu 12.04 LTS
I followed the installation step in Quickstart section of http://wiki.postgresql.org/wiki/Apt

best regards,
jacket41142