I wrote: > "Carson, Leonard" <lcarson@xxxxxxxx> writes: >> Here are the 3 views and some timing notes: >> http://pgsql.privatepaste.com/decae31693# > That doesn't really leave us any wiser than before, unfortunately. > It's clear that the root of the problem is the drastic underestimation of > the size of the rq/a join, but it's not clear why that's happening, nor > why 8.4 would not have fallen into the same trap. Leonard was kind enough to provide the problematic data off-list, and here's what I find after some poking around: 8.4 is not, in fact, any smarter than the more recent versions, it just happens to get lucky on this particular query. The core of the problem is this aspect of the projv view: SELECT ... FROM allocations a, ... other relations ... WHERE a.initial_start_date = (SELECT max(allocations.initial_start_date) FROM allocations WHERE allocations.account_id = a.account_id AND allocations.resource_id = a.resource_id) AND ... a bunch of other conditions ... (There's a similar consider-only-the-latest-row construction for accounts_history, which doubles the problem, but let's just look at this one for now.) Now there are two things that are bad about this: first, the construct requires executing the MAX-subselect for every row of "a", which is expensive and the planner knows it. Second, it's very very hard for the planner to guess right about the selectivity of this condition on a.initial_start_date. It ends up using DEFAULT_EQ_SEL which is 0.005, but given Leonard's data set the actual selectivity is just about 1.0, ie, there are no records that aren't the latest for their account_id cum resource_id and thus no rows are eliminated by the condition anyway. So we have an expensive scan on "a" that is going to produce many more rows than the planner thinks. By the time we get done joining to accounts_history, which has a similar problem, the planner is estimating only one row out of the join (vs. nearly 24000 in reality), and it's setting the total cost estimate at 148163 cost units. This just totally bollixes planning of the joins to the remaining half-dozen tables. The single-row estimate is nearly fatal in itself, because it encourages nestloop joining which is pretty inappropriate here. But the other problem is that the planner considers less-than-1% differences in cost estimates to be "in the noise", which means that it's not going to consider cost differences of less than 1480 units in the remaining join steps to be significant. This is how come we end up with the apparently brain-dead decisions to use seqscans on some of the other tables such as "pi" and "ac": comparing the seqscan to a potential inner indexscan, the total cost of the join is "the same" according to the 1% rule, and then the first tiebreaker is startup cost, and the indexscan has a higher startup cost. Now, 8.4 also had the 1% rule, but it had slightly different tiebreaking procedures, which caused it to end up picking the inner indexscans over the seqscans; and in this particular data set inner indexscans do far better than repeated seqscans when the rq/a/ah join turns out to produce 24000 times more tuples than predicted. But I can't persuade myself that the tiebreaking changes amount to a bug. (I did experiment with varying the tiebreaking rules a bit, but I think that would just be moving the pain around.) Long-term planner fixes for this type of problem might include improving the statistics enough that we could get better rowcount estimates. (Cross-column stats would help, since a contributing factor is that some of the joins are on two join columns that are pretty heavily correlated.) Another thing we've discussed is making risk estimates, whereby we could realize that the nestloop-plus-seqscan plans are going to be a lot worse if our rowcount estimates are off at all. But both of those things are research projects. What seems like a potential near-term fix for Leonard is to recast his views to do the latest-row selection more intelligently. I experimented with redoing the projv view like this to eliminate the subselects-in-WHERE: SELECT ... FROM acct.requests rq, acct.fields_of_science fos, acct.accounts ac, acct.allocations a, (select account_id, resource_id, max(initial_start_date) AS initial_start_date FROM acct.allocations GROUP BY 1,2) a_latest, acct.transaction_types tt, acct.resources ar, acct.accounts_history ah, (select account_id, resource_id, max(activity_time) AS activity_time FROM acct.accounts_history GROUP BY 1,2) ah_latest, acct.allocation_states sx, acct.principal_investigators pi, acct.people p WHERE a.account_id = ac.account_id AND a.account_id = a_latest.account_id AND a.resource_id = a_latest.resource_id AND a.initial_start_date = a_latest.initial_start_date AND rq.account_id = a.account_id AND rq.start_date = a.initial_start_date AND ar.resource_id = a.resource_id AND a.allocation_type_id = tt.transaction_type_id AND ah.account_id = a.account_id AND ah.resource_id = a.resource_id AND ah.account_id = ah_latest.account_id AND ah.resource_id = ah_latest.resource_id AND ah.activity_time = ah_latest.activity_time AND sx.state_id = ah.state_id AND rq.primary_fos_id = fos.field_of_science_id AND pi.request_id = rq.request_id AND p.person_id = pi.person_id ; That produces significantly better plans. It doesn't look like the rowcount estimates are better :-( ... but the total estimated cost is now down in the range of 3000 or so cost units, which means that the 1% rule doesn't keep us from adopting the inner indexscans. And this is fundamentally a better way to do latest-row selection, anyhow. (I guess another potential research project is to do this sort of aggregated-subselect transformation automatically. But don't hold your breath for that to happen, either.) regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance