Re: Simple join optimized badly?

"Bucky Jordan" <bjordan@xxxxxxxxxx> · Wed, 11 Oct 2006 10:27:26 -0400

> Brian Herlihy <btherl@xxxxxxxxxxxx> writes:
> > What would it take for hints to be added to postgres?
> 
> A *whole lot* more thought and effort than has been expended on the
> subject to date.
> 
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y".  You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).
> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions.  Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.
> 

Do I understand correctly that you're suggesting it might not be a bad
idea to allow users to provide statistics?

Is this along the lines of "I'm loading a big table and touching every
row of data, so I may as well collect some stats along the way" and "I
know my data contains these statistical properties, but the analyzer
wasn't able to figure that out (or maybe can't figure it out efficiently
enough)"?

While it seems like this would require more knowledge from the user
(e.g. more about their data, how the planner works, and how it uses
statistics) this would actually be helpful/required for those who really
care about performance. I guess it's the difference between a tool
advanced users can get long term benefit from, or a quick fix that will
probably come back to bite you. I've been pleased with Postgres'
thoughtful design; recently I've been doing some work with MySQL, and
can't say I feel the same way.

Also, I'm guessing this has already come up at some point, but what
about allowing PG to do some stat collection during queries? If you're
touching a lot of data (such as an import process) wouldn't it be more
efficient (and perhaps more accurate) to collect stats then, rather than
having to re-scan? It would be nice to be able to turn this on/off on a
per query basis, seeing as it could have pretty negative impacts on OLTP
performance...

- Bucky