Re: Index optimization ?

"Jim C. Nasby" <decibel@xxxxxxxxxxx> · Tue, 18 Jan 2005 22:11:16 -0600

On Tue, Jan 18, 2005 at 11:03:22PM -0300, Alvaro Herrera wrote:
> On Tue, Jan 18, 2005 at 07:33:51PM -0600, Jim C. Nasby wrote:
> > On Wed, Jan 19, 2005 at 02:15:42AM +0100, Florian G. Pflug wrote:
> > > You can, howevery, accelerate something like "where f in (1,2,3,4)". You 
> > > just scan the index 4 times, each time for a different value. Of course,
> > > if the number of values becomes larger and larger, there is a point 
> > > where it's more efficient to do a sequential scan _once_, instead of a 
> > > few tousand index scans (depends on the number of rows in the table).
> > > The postgres optimizer tries to estimate this, and will switch to an 
> > > seq-scan, if it would have to do too many index lookups.
> > 
> > Are PostgreSQL Btree indexes setup as a linked-list so you can scan
> > forwards and backwards in them?
> 
> Yes, they are.
> 
> > If so, is the IN processor smart enough to collapse ranges of values
> > into a single index scan
> 
> No, it isn't AFAIK.
> 
> > (ie, IN(1,2,3,4,8,9,10) would best be done as an index scan starting
> > at 1 and stoping at >4 and a second scan starting at 8 and stopping at
> > >10).
> 
> That assumes the optimizer knows that the domain can contain integer
> values ... seems a complex and infructuous analysis to do in general.
> 
> Maybe the optimizer could collapse that to a single index scan from 1 to
> 10 and then apply a filter to extract only the values actually mentioned.
> Not sure how workable that is.

It seems like it should just be a SMOC (simple matter of code), though I
suspect better index statistics might be needed first. Given good enough
statistics, it should be easy to ascertain how many index tuples you can
expect between two values (even if it's not an int column, in this
case). The optimizer should also be able to know what the cost is to
scan down the index structure, so it should then be easy to figure out
which plan is better.

Speaking of index statistics, has anyone looked at ways to determine how
many tuples exist between two different values? For example, Sybase used
to keep statistics in 256 buckets. It would define the range of values
that each bucket covered, and how many tuples in the table fell into
that bucket. It's a very elegant way to deal with tables that have
skewed numbers, as well as clusters of different numbers.
-- 
Jim C. Nasby, Database Consultant               decibel@xxxxxxxxxxx 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@xxxxxxxxxxxxxx)