On Fri, Apr 23, 2010 at 3:22 PM, Cédric Villemain <cedric.villemain.debian@xxxxxxxxx> wrote: > 2010/4/23 Robert Haas <robertmhaas@xxxxxxxxx>: >> On Fri, Apr 23, 2010 at 9:09 AM, Cédric Villemain >> <cedric.villemain.debian@xxxxxxxxx> wrote: >>> 2010/4/23 Robert Haas <robertmhaas@xxxxxxxxx>: >>>> On Thu, Apr 22, 2010 at 10:37 PM, Vlad Arkhipov <arhipov@xxxxxxxxxxxx> wrote: >>>>> I don't think this is just an issue with statistics, because the same >>>>> problem arises when I try executing a query like this: >>>> >>>> I'm not sure how you think this proves that it isn't a problem with >>>> statistics, but I think what you should be focusing on here, looking >>>> back to your original email, is that the plans that are actually much >>>> faster have almost as much estimated cost as the slower one. Since >>>> all your data is probably fully cached, at a first cut, I might try >>>> setting random_page_cost and seq_page_cost to 0.005 or so, and >>>> adjusting effective_cache_size to something appropriate. >>> >>> that will help worrect the situation, but the planner is loosing here I think. >> >> Well, what do you think the planner should do differently? > > Here the planner just divide the number of rows in the t2 table by the > number of distinct value of t1.t. this is the rows=20200 we can see in > the explains. > It seems it is normal, but it also looks to me that it can be improved. > When estimating the rowcount to just num_rows/n_distinct, it *knows* > that this is wrong because the most_common_freqs of t2.t say that of > the 99600 rows have the value 1, or less than 200 in all other case. > So in every case the planner make (perhaps good) choice, but being > sure its estimation are wrong. > I wonder if we can improve the planner here. > > In this case where the number of rows is lower than the stats > target(in t1.t), perhaps the planner can improve its decision by going > a bit ahead and trying plan for each n_distinct values corresponding > in t2.t . > > I haven't a very clear idea of how to do that, but it may be better if > the planner estimate if its plan is 100%(or lower, just an idea) sure > to hapen and that's fine, else try another plan. > > in this test case, if the query is : > select * > from t2 > join t1 on t1.t = t2.t > where t1.id = X; > > if X=1 then the planner has 20% of chance that the rowcount=99600 and > 80% that rowcount=200 or less, by providing a rowcount=20200 how can > it find the good plan anyway ? Is it beter to start with bad > estimation and perhaps find a good plan, or start with estimation > which may be bad but lead to a good plan in more than XX% of the > cases. > > So, currently, the planner do as expected, but can we try another > approach for those corner cases ? Hmm. We currently have a heuristic that we don't record a value as an MCV unless it's more frequent than the average frequency. When the number of MCVs is substantially smaller than the number of distinct values in the table this is probably a good heuristic, since it prevents us from bothering with the recording of some values that are probably only marginally more interesting than other values we don't have space to record. But if ndistinct is less than the stats target we could in theory record every value we find in the MCVs table and leave the histogram empty. Not sure if that would be better in general, or not, but it's a thought. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance