Re: Simple join optimized badly?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



--- Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y".  You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).

I don't agree here.  I have "exactly Y" running millions of times daily. 
There's enough data that the statistics on specific values don't help all that
much, even at the maximum statistics collection level.  By "exactly Y" I mean
the form of the query is identical, and the query plan is identical, since only
the general statistics are being used for most executions of the query.  The
specific values vary, so caching is no help.

In summary, I have a need to run "exactly Y" with query plan "exactly X".
(detail in postscript)

> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions.

Do you mean something like "The selectivity of these two columns together is
really X"?  That would solve my specific problem.  And the academic part of me
likes the elegance of that solution.

On the negative side, it means people must learn how the optimizer uses
statistics (which I would never have done if I could have said "Use index X").

> Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.

Yes it would be useful for more than one query.  I agree that it's the "right"
level to hint at, in that it is at a higher level.  Maybe the right level is
not the best level though?  In a business environment, you just want things to
work, you don't want to analyze a problem all the way through and find the
best, most general solution.  As a former academic I understand the two points
of view, and I don't think either is correct or wrong.  Each view has its
place.

Since I work for a business now, my focus is on making quick fixes that keep
the system running smoothly.  Solving problems in the "right" way is not
important.  If the query slows down again later, we will examine the query plan
and do whatever we have to do to fix it.  It's not elegant, but it gives fast
response times to the customers, and that's what matters.


PS The case in question is a table with a 3-column primary key on (A, B, C). 
It also has an index on (B, C).  Re-ordering the primary key doesn't help as I
do lookups on A only as well.  When I specify A, B and C (the primary key), the
optimizer chooses the (B, C) index, on the assumption that specifying these two
values will return only 1 row.  But high correlation between B and C leads to
100s of rows being returned, and the query gets very slow.  The quick fix is to
say "Use index (A, B, C)".  The statistics level fix would be to say "B and C
really have high correlation".


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux