Re: anti-join chosen even when slower than old plan

Robert Haas <robertmhaas@xxxxxxxxx> · Wed, 10 Nov 2010 17:07:52 -0500

On Wed, Nov 10, 2010 at 10:15 AM, Kevin Grittner
<Kevin.Grittner@xxxxxxxxxxxx> wrote:
> But wait -- it turns out that this pain was self-inflicted.  Based
> on heavy testing of the interactive queries which users run against
> this database we tuned the database for "fully-cached" settings,
> with both random_page_cost and _seq_page_cost at 0.1.  In a
> practical sense, the users are almost always running these queries
> against very recent data which is, in fact, heavily cached -- so
> it's no surprise that the queries they run perform best with plans
> based on such costing.  The problem is that these weekly maintenance
> runs need to pass the entire database, so caching effects are far
> less pronounced.  If I set seq_page_cost = 1 and random_page_cost =
> 2 I get exactly the same (fast) plan as above.
>
> I guess the lesson here is not to use the same costing for
> database-wide off-hours maintenance queries as for ad hoc queries
> against a smaller set of recent data by users who expect quick
> response time.  I'm fine with tweaking the costs in our maintenance
> scripts, but it does tend to make me daydream about how the
> optimizer might possibly auto-tweak such things....

Wow.  That's fascinating, and if you don't mind, I might mention this
potential problem in a future talk at some point.

I've given some thought in the past to trying to maintain some model
of which parts of the database are likely to be cached, and trying to
adjust costing estimates based on that data.  But it's a really hard
problem, because what is and is not in cache can change relatively
quickly, and you don't want to have too much plan instability.  Also,
for many workloads, you'd need to have pretty fine-grained statistics
to figure out anything useful, which would be expensive and difficult
to maintain.

But thinking over what you've written here, I'm reminded of something
Peter said years ago, also about the optimizer.  He was discussed the
ratio of the estimated cost to the actual cost and made an off-hand
remark that efforts had been made over the years to make that ratio
more consistent (i.e. improve the quality of the cost estimates) but
that they'd been abandoned because they didn't necessarily produce
better plans.  Applying that line of thinking to this problem, maybe
we should give up on trying to make the estimates truly model reality,
and focus more on assigning them values which work well in practice.
For example, in your case, it would be sufficient to estimate the
amount of data that a given query is going to grovel through and then
applying some heuristic to choose values for random_page_cost and
seq_page_cost based on the ratio of that value to, I don't know,
effective_cache_size.

Unfortunately, to know how much data we're going to grovel through, we
need to know the plan; and to decide on the right plan, we need to
know how much data we're going to grovel through.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance