Re: indexing for distinct search in timestamp based table

"Rainer Mager" <rainer@xxxxxxxxxx> · Mon, 8 Sep 2008 12:54:04 +0900

Thanks for the suggestion. This seems to work pretty well on
8.3, but not so well on 8.2. We were planning on upgrading to 8.3 soon anyway, we
just have to move up our schedule a bit.

I think that this type of algorithm would make sense in core. I
suspect that being in there some further optimizations could be done that
pl/pgsql can’t do.

--Rainer

From:
pgsql-performance-owner@xxxxxxxxxxxxxx
[mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of Vladimir
Sitnikov

Sent: Saturday, September 06, 2008 12:11 AM

To: pgsql-performance@xxxxxxxxxxxxxx

Subject: Re: [PERFORM] indexing for distinct search in timestamp based
table

You might get great improvement for '%' cases using index
on channel_name(<field>, start_time) and a little bit of pl/pgsql

Basically, you need to implement the following algorithm:

 1) curr_<field> = ( select
 min(<field>) from ad_log )

 2) record_exists = ( select 1 from ad_log where
<field>=cur_<field> and _all_other_conditions limit 1 )

 3) if record_exists==1 then add curr_<field> to
the results

 3) curr_<field> = (select min(<field>)
from ad_log where <field>  >  curr_<field> ) 

 4) if curr_<field> is not null then goto 2

I believe it might make sense implement this approach in the
core (I would call it "index distinct scan")

That could dramatically improve "select distinct
<column> from <table>" and "select <column> from
<table> group by <column>" kind of queries when there exists
an index on <column> and a particular column has very small number of
distinct values.

For instance:  say a table has 10'000'000 rows, while
column of interest has only 20 distinct values. In that case, the database will
be able to get every of those 20 values in virtually 20 index lookups.

What does the community think about that?