Re: select distinct and index usage

"David Wilson" <david.t.wilson@xxxxxxxxx> · Tue, 8 Apr 2008 04:23:04 -0400

On Mon, Apr 7, 2008 at 9:11 PM, Stephen Denne
<Stephen.Denne@xxxxxxxxxxxxxx> wrote:
>
>  You may be able to make use of an index by rearranging your query to generate a series between your min & max values, testing whether each value is in your table.
>
>  You've got 4252 distinct values, but what is the range of max - min? Say it's 5000 values, you'd do 5000 lookups via an index, unless postgres thought that the number of index based lookups where going to be more expensive than reading the entire table.

Upon further investigation, the above works very well:

explain analyze select ds from (select generate_series((select
datestamp from vals order by datestamp asc limit 1), (select datestamp
from vals order by datestamp desc limit 1), 86400) as ds) series where
exists (select datestamp from vals where datestamp=ds);

             QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan series  (cost=4.89..8.90 rows=1 width=4) (actual
time=0.080..25264.239 rows=4252 loops=1)
   Filter: (subplan)
   ->  Result  (cost=4.89..4.90 rows=1 width=0) (actual
time=0.051..7.491 rows=6163 loops=1)
         InitPlan
           ->  Limit  (cost=0.00..2.45 rows=1 width=4) (actual
time=0.023..0.024 rows=1 loops=1)
                 ->  Index Scan using val_datestamp_idx on vals
(cost=0.00..184401441.14 rows=75391472 width=4) (actual
time=0.021..0.021 rows=1 loops=1)
           ->  Limit  (cost=0.00..2.45 rows=1 width=4) (actual
time=0.020..0.020 rows=1 loops=1)
                 ->  Index Scan Backward using val_datestamp_idx on
validations  (cost=0.00..184401441.14 rows=75391472 width=4) (actual
time=0.018..0.018 rows=1 loops=1)
   SubPlan
     ->  Index Scan using val_datestamp_idx on validations
(cost=0.00..70453.21 rows=17685 width=4) (actual time=4.096..4.096
rows=1 loops=6163)
           Index Cond: (datestamp = $0)
 Total runtime: 25267.033 ms
(12 rows)

The series generates all the possible datestamps + about 40% extra.

What's particularly interesting here to me is that it at least seems
like this validates my original assumption that if the planner could
be coaxed into using the index it would be faster- or am I missing
something? This query, at 25 seconds, was certainly much faster than
even the GROUP BY version that ran in 120 seconds.

As before, thanks for all of the information and ideas. Down from 722
seconds to 25 seconds is a hefty improvement.

-- 
- David T. Wilson
Princeton Satellite Systems
david.t.wilson@xxxxxxxxx