Re: Using the wrong index (very suboptimal), why?

"Shane Wright" <shane.wright@xxxxxxxxxxxxxxxxxxxx> · Tue, 23 Dec 2008 09:56:03 -0000

Tom,

> You need to
> look into what the estimated vs actual rowcounts are for just the
> two-column condition (ie, where nid = something AND iid1 = something).
> Getting those estimates to match reality is the key to getting a sane
> plan choice here.

Many thanks - I agree in principle it isn't automatically the worst
decision, however for this data set it unfortunately is :(

Improving the accuracy of the cost estimates is exactly what I'm trying
to achieve, so far I've tried....

- increasing the statistics target to the maximum setting with SET
STATISTICS 1000 on columns rid, nid and iid1 of answers, then
re-vacuuming

- adjusting random_page_cost downwards even more to try to make it
prefer index IO

- increasing cpu_tuple_cost up to 1.0 (from 0.01) to try to discourage
it from processing data rows

- decreating cpu_index_tuple_cost to 0.0005 (from 0.005) to encourage it
to spend more time in index rows.

None of the above had any effect :(

Interestingly, I didn't revacuum between changing the cost figues in the
config file (I did reload).  The cost figures from EXPLAIN did change,
but using the queries below the 'wrong' index always comes up just over
1/3rd more expensive.

Here's a query that matches actual data in the table; it sees the cost
of the 2-column index as much less, but it takes an order of magnitude
longer to run (using nid & iid1 means filtering through around 22,000
rows).  The three queries; first is the one that's used in practise;
against all 3 columns.  The next 2 queries use the two different
combinations of just 2 columns:

emystery=# explain analyse select * FROM answers WHERE rid=668332334 AND
nid=253869913 AND iid1=535292129;
                                                                  QUERY
PLAN
------------------------------------------------------------------------
----------------------------------------------------------------------
 Index Scan using index_answers_nidiid1 on answers  (cost=0.00..114.14
rows=1 width=60) (actual time=0.168..790.262 rows=1 loops=1)
   Index Cond: ((nid = 253869913) AND (iid1 = 535292129))
   Filter: (rid = 668332334)
 Total runtime: 790.305 ms
(4 rows)

emystery=# explain analyse select * FROM answers WHERE rid=668332334 AND
nid=253869913;
                                                                   QUERY
PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
 Index Scan using index_answers_ridnidiid1 on answers
(cost=0.00..152.49 rows=1 width=60) (actual time=0.120..0.141 rows=18
loops=1)
   Index Cond: ((rid = 668332334) AND (nid = 253869913))
 Total runtime: 0.207 ms
(3 rows)

emystery=# explain analyse select * FROM ecos_answers WHERE
nid=253869913 AND iid1=535292129;

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
---
 Index Scan using index_ecos_answers_nidiid1 on ecos_answers
(cost=0.00..114.14 rows=1 width=60) (actual time=0.206..2398.645
rows=21891 loops=1)
   Index Cond: ((nid = 253869913) AND (iid1 = 535292129))
 Total runtime: 2424.769 ms
(3 rows)

What setting should I be tweaking to make the two column index more
expensive?  I thought I would need to either...

- inform the database that there are likely to be a lot of rows that
need filtering (SET STATISTICS [maximum]???)
or
- tell the database that filtering rows is expensive (cpu_tuple_cost++,
random_page_cost--  ???)

...but both of which I've already tried....but it would seem that from
the EXPLAIN output that the first one is key; it shouldn't think (for
the last example( that it's only going to get 1 row back!) but aside
from increasing SET STATISTICS what other setting is there to inform it
of this?

The only thing I can think that's remaining is to rebuild/replace the
preferred index, that should make it smaller and *if* the index cost
estimates are partly based on # disk pages that should reduce the
perceived cost of using it.  Is this the case?   Even if it does that
still doesn't fix the underlying problem.

Any help appreciated,

Many thanks

Shane

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general