Re: select distinct and index usage

"David Wilson" <david.t.wilson@xxxxxxxxx> · Mon, 7 Apr 2008 20:54:38 -0400

On Mon, Apr 7, 2008 at 7:57 PM, Gregory Stark <stark@xxxxxxxxxxxxxxxx> wrote:
>
>  You could try changing it to the equivalent GROUP BY query. The planner,
>  unfortunately, doesn't know they're equivalent and has two separate sets of
>  plans available. In this case where there are only 4,000 distinct values out
>  of 75M original records you might find a HashAggregate plan, which the planner
>  doesn't know can be used for DISTINCT, best. You might have to raise work_mem
>  before the planner feels a hash will fit.
>
>  --
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>   Ask me about EnterpriseDB's RemoteDBA services!
>

Progress!

explain analyze select datestamp from vals group by datestamp;
                                                             QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1719740.40..1719783.03 rows=4263 width=4)
(actual time=120192.018..120193.930 rows=4252 loops=1)
   ->  Seq Scan on vals  (cost=0.00..1531261.72 rows=75391472 width=4)
(actual time=17.441..66807.429 rows=75391476 loops=1)
 Total runtime: 120195.144 ms

Compared with:

explain analyze select distinct datestamp from vals;
                                                              QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=15003047.47..15380004.83 rows=4263 width=4) (actual
time=649599.159..721671.595 rows=4252 loops=1)
  ->  Sort  (cost=15003047.47..15191526.15 rows=75391472 width=4)
(actual time=649599.157..694392.602 rows=75391476 loops=1)
        Sort Key: datestamp
        Sort Method:  external merge  Disk: 1178592kB
        ->  Seq Scan on vals  (cost=0.00..1531261.72 rows=75391472
width=4) (actual time=9.104..93130.468 rows=75391476 loops=1)
 Total runtime: 722379.434 ms

Still doing the sequential scan on the table, but at least it's
avoiding the expensive disk merge sort. It still seems as if I ought
to be able to coax it into using an index for this type of query,
though- especially since it's using one on the other table. Is there
perhaps some way to reformulate the index in such a way as to make it
more useful to the planner?

-- 
- David T. Wilson
Princeton Satellite Systems
david.t.wilson@xxxxxxxxx