>> This can se GUC-controllable. Like plan_safety=0..1 with low default value. >> This can influence costs of plans where cost changes dramatically with small >> table changes and/or statistics is uncertain. Also this can be used as >> direct "hint" for such dangerous queries by changing GUC for session/single >> query. > > ISTM if you add statistics miss and 'risk margin' to the things the > planner would have to consider while generating a plan, you are > greatly increasing the number of plan paths that would have to be > considered for any non trivial query. FWIW, these ideas are well established in the statistical community. Currently, we are essentially using "maximum likelihood estimators". We estimate a bunch of selectivities by choosing what is most likely, plug them in to our objective function, and then minimize based upon the plugged in values. In a wide variety of cases, MLE's can be shown to be "asymptotically" optimal. That is, as our sample distribution approaches the true distribution, the best we can possibly do is to use the MLE. This is pretty sensible - if we actually knew all of the selectivities then the results aren't really random anymore. However, they often perform very poorly with small sample sizes - particularly if the loss function is very sensitive to relatively small fluctuations in the parameter estimates ( which postgres certainly is - switching from a hash join to a nest-loop can be disastrous ). Using the estimate that minimizes the "worst-case" performance is precisely a minimax estimator. There, the goal is to minimize the risk function ( iow, plan cost ) under the worst possible conditions. Wikipedia has a pretty good treatment - just think "plan cost" whenever you see "risk". Another approach, that hasn't been suggested yet, is some Bayesian update method. There, rather than calculating a specific parameter value ( like ndistinct ), you try to store the entire distribution and choose the plan that minimizes cost averaged over all of the possible parameter values. Example: ( please excuse the unrealistic numbers ) For instance, rather than estimate the selectivity of the join ( relation1.col1 = relation2.col1 ) to be 0.01, we would say it is 0.1 w/ probability 0.2 and 0.001 with probability 0.8. So, here is how we would choose the plan now: cost( nestloop | selectivity = 0.01 ) = 1 cost( hashjoin | selectivity = 0.01 ) = 2 cost( mergejoin | selectivity = 0.01 ) = 50 Here would be the bayesian approach: cost( nestloop | selectivity = 0.001 ) = 0.1 cost( hashjoin | selectivity = 0.001 ) = 1 cost( mergejoin | selectivity = 0.001 ) = 50 cost( nestloop | selectivity = 0.1 ) = 10 cost( hashjoin | selectivity = 0.1 ) = 3 cost( mergejoin | selectivity = 0.1 ) = 50 So, the bayesian costs are: nestloop: 0.1*0.8 + 10*0.2 = 2.08 hashjoin: 1.0*0.8 + 3*0.2 = 1.4 nestloop: 50*0.8 + 50*0.2 = 50 so the hashjoin would be chosen. For completeness, the minimax costs would be: nestloop: max( 0.1, 10 ) hashjoin: max( 1, 3 ) nestloop: max( 50, 50 ) So, again, the hashjoin is chosen. I obviously have a bias towards the Bayesian approach, but it's not because I expect it to necessarily perform a whole lot better but, rather, it reduces to the other two approaches. If we want the current behavior, then simply store the MLE selectivity with probability 1. If we want the minimax estimate, choose the worst possible value. Or anything in between. Also, ( not that I have even close to the experience / expertise to make this claim - so take this with a grain of salt ) it seems that the code changes would be substantial but pretty straightforward and easy to localize. Rather than passing a selectivity, pass a pair of arrays with selectivities and probabilities. Initially, we could keep the current estimates ( every passed array would be of length one ) and then make changes as problems appear ( like Josh's ) I hope my little estimation procedure tutorial has been a little helpful, please feel free to contact me off list if you have questions/want references. Best, Nathan Boley -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance