Re: Question about (probably wrong) index scan cost for conditional indexes

Maxim Boguk <maxim.boguk@xxxxxxxxx> · Mon, 30 Jan 2012 12:24:46 +1100

On Mon, Jan 30, 2012 at 12:02 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> Maxim Boguk <maxim.boguk@xxxxxxxxx> writes:
>> Seems previous test case not clear demonstrate the problem which i have
>> stuck with.
>> Now much better and close to reality test case:
>
> AFAICT, these behaviors all boil down to the fact that contrib/intarray
> doesn't provide a real cost estimator for its && operator.  It's using
> the "contsel" stub function, which provides a fixed selectivity of
> 0.001.  In your test case, with 1000000 rows in the table, the estimate
> for the number of rows satisfying "sections && '{2}'" thus comes out to
> exactly 1000.  Unfortunately, the true number is around 100000, and it's
> that discrepancy that is leading to all of these bad cost estimates.
>
> What I'd like to see done about this is for somebody to adapt the
> work Jan Urbanski did on tsvector stats collection and estimation
> so that it works for the anyarray operators.  It's a bit too late
> to imagine that that'll get done for 9.2, but maybe for 9.3.
>
>                        regards, tom lane

Hi,

Thank you very much for the answer.
I know there is issue with statistics over intarrays (it was there
very long time and sometime it's complicating things a lot).

However,  the 100x cost difference between:
SELECT * from test order by id limit 100;  (over "primary key (id)" btree index)
Limit  (cost=0.00..3.43 rows=100 width=37)
vs
SELECT * from test where sections && '{2}' order by value limit 100;
(over "test_value_in2section_key on test(value) where sections &&
'{2}'"   btree index)
Limit  (cost=0.00..539.29 rows=100 width=37)
seems wrong for me.

Both queries performs the absolutely same task: fetch 100 entries from
the table based on the ideally suitable index (no post
processing/filtering were done at all... just return 100 sorted tuples
based on single index scan).

I don't understand where 2x+ order of cost difference come from.

And even if I drop the intarray index completely, than I still have a
wrong plan (bitmap scan + sort),  because planner cost for the index
scan over conditional index 100 more the it should be.
(e.g. there is still an issue even in absence of the intarray index).

Is absence of frequency statistics over intarrays somehow linked to
the wrong planner cost estimates for conditional index scan?

King Regards,
Maxim Boguk

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general