On 8/7/09 5:53 AM, "Robert Haas" <robertmhaas@xxxxxxxxx> wrote: > On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dieren<keesvandieren@xxxxxxxxx> > wrote: >> Would it get attention if I submit this to >> http://www.postgresql.org/support/submitbug ? (in fact it is not really a >> bug, but an improvement request). > > I think that many of the people who read that mailing list also read > this one, including, critically, Tom Lane, and you're not the first > person to run into a problem caused by lack of cross-column > statistics. Critically, it should be understood that this general problem is not just born from lack of cross-column statistics. It is also one of using the statistical expected value to calculate cost without consideration of the variance. Often, the cost of a plan varies widely and nonlinearly with a small change in the expected value the stats used to estimate cost. The problem is acute with LIMIT and various other boundary conditions where there is a steep 'cost cliff' for certain plan types. When LIMIT is applied, the planner changes its estimates, but does not take into account the _greatly_ increased uncertainty of those estimates. Imagine a case where the planner's estimate is 100% correct, and on average one side of a join will have 2 tuples. The planner chooses nested loops to do that join. But the distribution of the number of tuples at this node is skewed, so although the expected value is 2, a values of 10 and 0 are both common. When values of 10 occur, the execution time goes up significantly. An alternate plan might be slower for the case where the actual values in execution equal the expected values, but faster in the average case! The average cost of a plan is NOT that cost of the query with average statistics, due to variance, nonlinearity, and skew. And even if they were equal, it might not be preferable to choose the plan that is best on average over the one that is best at the 90th percentile. Getting back to the case above with the nestloop -- if the planner estimated the cost of the nestloop join versus other joins with some idea of the variance in the estimations it could favor more 'stable' execution plans. So in short, cross-column statistics don't have to be gathered and used to make this problem less acute. The planner just has to be more aware of variance and the things that lead to it, such as column correlation. Thus with a LIMIT applied, the expected value for the number of tuples scanned before a match will shrink, but the uncertainty of this estimate grows significantly as well, so the right plan to choose is one that hedges against the uncertainty, not the one that assumes the expected value is correct. Gathering and using cross column statistics will change the expected value for some plans, but more importantly will allow the uncertainty to be reduced. Better stats go hand in hand with the uncertainty analysis because one can never have all cross column statistics, across all tables and all join-function spaces, analyzed. Stats gathering can never be complete or without flaws. The planner can never be perfect. > I don't think you're going to get very far by submitting > this as a bug. There are already several people interested in this > problem, but as most of us don't get paid to hack on PostgreSQL, it's > a question of finding enough round tuits; this is not an easy thing to > fix. > > If you are sufficiently bothered by this problem that you are willing > to pay someone to fix it for you, there are several companies with > whom you can contract to get this feature developed and committed for > the next release of PostgreSQL. > > ...Robert > > -- > Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance