On 2016-03-08 10:16:57 +0000, Geoff Winkless wrote: > On 7 March 2016 at 20:40, Peter J. Holzer <hjp-pgsql@xxxxxx> wrote: > > As Tom wrote, the estimate of having to read only about 140 rows is only > > valid if sc_id and sc_date are uncorrelated. In reality your query has > > to read a lot more than 140 rows, so it is much slower. > > But as I've said previously, even if I do select from scdate values > that I know to be in the first 1% of the data (supposedly the perfect > condition) the scan method is insignificantly quicker than the index > (scdate,scid) method. Actually the planner expects find a match within the first 0.0035 %, so to find out how fast that would be you would have to use a value from that range. > Even with the absolute perfect storm (loading in the entire index for > the full range) it's still not too bad (1.3 seconds or so). > > The point is that to assume, knowing nothing about the data, that the > data is in an even distribution is only a valid strategy if the worst > case (when that assumption turns out to be wildly incorrect) is not > catastrophic. That's not the case here. True. The fundamental problem here is that the planner doesn't have any notion of a worst case. It only knows "cost", and that is a single number for each operation. For many operations, both the best case and the worst case are unusable as cost - the first would almost always underestimate the time and choose a plan which is far from optimal and the second would almost always overestimate it and reject an optimal plan. The art of programming a planner (which I've dabbled with in a previous (not postgresql-related) project but certainly can't claim any expertise in) lies in choosing a cost function which is quite close most of the time and catastrophically wrong only very rarely. It is clear that PostgreSQL hasn't succeed in the latter category: Correlated columns do occur and the current cost function, which assumes that all columns are uncorrelated can catastrophically underestimate the cost in this case. The question is what can be done to improve the situation. Tom thinks that correlation statistics would help. That seems plausible to me. You claim that no statistics are needed. That may or may not be true: You haven't proposed an alternate method yet. I feel fairly certain that using the worst case (the cost for scanning the whole table) would be just as bad in and would cause inferior plans to be used in many instances. Maybe computing the cost as weighted average of the best, average and worst case (e.g. cost = cost_best*0.05 + cost_avg*0.90 + cost_worst*0.05) would penalize methods with a large spread between best and worst case enough - but that still leaves the problem of determining the weights and determining what the "average" is. So it's the same black magic as now, just the little more complicated (on the plus side, this would probably be a relatively simple patch). If we assume that we could revamp the planner completely, other possibilities come to mind: For example, since I think that the core problem is having a single number for the cost, the planner could instead compute a distribution (in the most simple case just best and worst case, but ideally many values). Then the planner could say something like: I have two plans A nd B and A is at most 20 % faster in almost all cases. But in the worst case, A is 1000 times slower. Being 20 % faster most of the time is nice but doesn't outweigh the risk of being 1000 times slower sometimes, so I'll use B anyway. Another possibility I've been considering for some time is feeding back the real execution times into the planner, but that sounds like a major research project. (Actually I think Oracle does something like this since version 12) hp -- _ | Peter J. Holzer | I want to forget all about both belts and |_|_) | | suspenders; instead, I want to buy pants | | | hjp@xxxxxx | that actually fit. __/ | http://www.hjp.at/ | -- http://noncombatant.org/
Attachment:
signature.asc
Description: Digital signature