On 1 October 2014 19:56, Josh Berkus <josh@xxxxxxxxxxxx> wrote: > On 09/30/2014 04:01 PM, Simon Riggs wrote: >> On 30 September 2014 18:28, Jeff Janes <jeff.janes@xxxxxxxxx> wrote: >> >>>> Anyway, in the particular case I posted fixing n_distinct to realistic >>>> numbers (%) fixed the query plan. >>> >>> >>> But wouldn't fixing the absolute number also have fixed the plan? >> >> There are two causes of this issue. >> >> 1. Poor estimates of n_distinct. Fixable by user. >> >> 2. Poor assumption of homogeneous distribution. No way for user to >> fix. Insufficient stats detail to be able to solve in current planner. >> >> I see (2) as the main source of issues, since as we observe, (1) is fixable. > > I disagree that (1) is not worth fixing just because we've provided > users with an API to override the stats. It would unquestionably be > better for us to have a better n_distinct estimate in the first place. > Further, this is an easier problem to solve, and fixing n_distinct > estimates would fix a large minority of currently pathological queries. > It's like saying "hey, we don't need to fix the leak in your radiator, > we've given you a funnel in the dashboard you can pour water into." Having read papers on it, I believe the problem is intractable. Coding is not the issue. To anyone: please prove me wrong, in detail, with references so it can be coded. > I do agree that (2) is worth fixing *as well*. In a first > approximation, one possibility (as Tom suggests) would be to come up > with a mathematical model for a selectivity estimate which was somewhere > *between* homogenous distribution and the worst case. While that > wouldn't solve a lot of cases, it would be a start towards having a > better model. This may have a reasonable solution, but I don't know it. A more accurate mathematical model will still avoid the main problem: it is a guess, not certain knowledge and the risk will still remain. >>> I don't think correlation is up to the task as a complete solution, although >>> it might help a little. There is no way a simple correlation can encode >>> that John retired 15 years ago and hasn't logged on since, while Johannes >>> was hired yesterday and never logged on before then. >> >> Ah, OK, essentially the same example. >> >> Which is why I ruled out correlation stats based approaches and >> suggested a risk-weighted cost approach. > > By "risk-weighted" you mean just adjusting cost estimates based on what > the worst case cost looks like, correct? That seemed to be your > proposal from an earlier post. If so, we're in violent agreement here. I proposed a clear path for this earlier in the thread and received no comments as yet. Please look at that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance