Re: select DISTINCT

pg noob <pgnube@xxxxxxxxx> · Mon, 9 Sep 2013 10:10:35 -0400

Thank you Kevin and Jeff for the responses.
These are very helpful.

On Fri, Sep 6, 2013 at 10:48 PM, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:

On Friday, September 6, 2013, pg noob  wrote:

Hi all,

I'm curious about some of the query estimates that I'm seeing with queries that use DISTINCT.

I am using postgres 8.4.13

I did a couple of quick tests, and found that PostgreSQL seems to do some expensive work to

return DISTINCT rows.  This is contrary to what I was expecting because I expected that
if it knows that it is building a distinct result set that it would maintain
some kind of ordering as it built the results to be able to quickly throw

away duplicates without any real overhead to the query.

There is no cost-free way of doing that.   You either have to sort, or hash, or walk in index order, and none of those things are free.

...

But it gets a bit stranger than that.
According to the docs,

“DISTINCT ON column will eliminate all duplicates in the specified column; this is equivalent to using GROUP BY column.”

DISTINCT ON is not the same thing as DISTINCT, which is not the same as count(distinct ...)

Note that it says that using distinct is EQUIVALENT to using GROUP BY.

And yet, look at the execution time difference of DISTINCT vs. GROUP BY:

db=# explain analyze select count(distinct(column1)) from table1;
                                                          QUERY PLAN                                                          

------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=18596.606..18596.607 rows=1 loops=1)

   ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.003..2391.644 rows=16151368 loops=1)
 Total runtime: 18596.631 ms
(3 rows)

db=# explain analyze select count(foo.column1) from (select column1 from table1 group by column1) as foo;

Compare that to:

explain analyze select count(*) from (select DISTINCT column1 from table1) as foo;

Any ideas why this is?  Or what I am doing wrong?

The code that executes "count(distinct ....)" has never been taught how to use hash aggregation, whereas DISTINCT and GROUP BY have been.  It always sorts, which is often much slower than a hash, and also often much less memory efficient.   I speculate that this is because the implementation of count(distinct ...) is really ancient code and never been brought up to date, either about hashing or about more thorough EXPLAIN estimates--I have been meaning to dig into it but haven't gotten around to it yet.

You can get the increased performance by just spelling it in the more verbose way, but be aware that count (distinct ...) deals with NULL differently than select count(*) from (select DISTINCT...)  does, so they are not exactly identical.

Cheers,

Jeff