Re: Planner enhancement suggestion.

"Jim C. Nasby" <jnasby@xxxxxxxxxxxxx> · Tue, 7 Mar 2006 12:37:15 -0600

On Tue, Mar 07, 2006 at 07:09:15PM +0100, PFC wrote:
> 
> >The problem is that you're now talking about doing 2 index scans instead
> >of just one and a sort.
> 
> 	It depends on what you call an index scan :
> 	a- Scanning just the index (no heap lookup) to create a bitmap

Sure, and then you scan the other index and read the heap at the same
time (b). Your plan requires doing both. The question is: in what cases
will it be faster to scan the extra index and build the bitmap vs. just
doing a sort.

> 	b- Scanning the index and hitting the heap in index order to 
> 	retrieve the  rows
> 
> 	(a) should be quite fast, because indexes generally use less space 
> 	than  the main table, and have good locality of reference. (b) is OK if the 
> table fits in memory, but if it has to seek on every row from the heap...

If the table fits in memory, who cares? A sort should be damn fast at
that point, because you're dealing with a small set of data.

> 	So, when doing :
> 	SELECT * FROM products WHERE category=C ORDER BY price LIMIT 20;
> 
> 	If the category contains few products, using the index on category 
> 	then  sorting is good.
> 	However, if the category contains many items, postgres is likely to 
> 	use  the index on price to avoid the sort. It needs to lose time fetching 

Have you actually seen this behavior? My experience is that you have to
have a correlation somewhere over 80-90% before an index scan is favored
over a seqscan + sort (which as I mentioned before appears to be
broken).

> 	The bitmap trick I proposed in my previous post would be even more  
> interesting if the table is clustered on category (which seems a  
> reasonable thing to do).

In which case it's highly unlikely that using the price index will buy
you anything.

> 	Does the cost estimator know about this kind of correlation ?

Yes. The problem is that the index scan cost estimator figures out a
best and worst case cost, and then interpolates between the two using
correlation^2. IMO it should be using abs(correlation) to do this, and
there's some data at http://stats.distributed.net/~decibel/ that backs
this up. There's also been some discussions on -hackers (search the
archives for "index cost correlation nasby"), but I've not had time to
follow up on this. If you wanted to test a new index cost formula it
would be a one line change to the code.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@xxxxxxxxxxxxx
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461