Re: 8.2.4 serious slowdown

Tom Lane <tgl@xxxxxxxxxxxxx> · Thu, 10 Jan 2008 19:00:29 -0500

Sim Zacks <sim@xxxxxxxxxxxxxx> writes:
> I just upgraded my database server from 8.0.1 to 8.2.4
> Most things went very well, but I have a couple of queries that really slowed down with the new server.

The core of the problem seems to be the rowcount misestimation here:

>      ->  Merge Left Join  (cost=5816.85..5948.10 rows=1245 width=82) (actual time=1169.837..1220.895 rows=3685 loops=1)
>            Merge Cond: (("outer".batchid = "inner".refid) AND ("outer".partid = "inner".partid))
>            Filter: (COALESCE("inner".commited, false) = false)

vs in 8.2

>      ->  Merge Left Join  (cost=3984.15..4111.60 rows=1 width=28) (actual time=319.642..348.285 rows=3685 loops=1)
>            Merge Cond: ((d.batchid = f.refid) AND (e.partid = f.partid))
>            Filter: (NOT COALESCE(f.commited, false))

The single-row estimate causes it to go for nestloops at all the
higher join levels, and when the actual result size is 3685 rows, of
course it takes 3685 times longer than the planner expected :-(

I assume that the original query is something along the lines of

	d left join f on (...) where coalesce(f.commited, false) = false

I traced through what would happen here, and found that:

* 8.2 changes the "boolvar = false" clause to "NOT boolvar", because it
wants to be able to recognize these equivalent forms as equivalent.
8.0 just leaves it as-is.

* 8.0 can't figure out anything about a COALESCE, so it uses the
fallback DEFAULT_EQ_SEL (0.005) selectivity estimate for what it sees
as an equality clause.  This is apparently close enough to be within a
factor of 3 of reality.

* 8.2 sees a NOT clause, which clause_selectivity() figures has a
selectivity of 1.0 minus the selectivity of the argument, which is
a COALESCE clause, which the recursive call to clause_selectivity()
doesn't know anything about and so punts ... returning 1.0 which is
its default for unknown clause types.  So we end up with a selectivity
estimate of exactly 0.0, pinning the estimated join size to the minimum
of 1 row.

The default 1.0 selectivity estimate seems fairly silly; in other cases
where the planner really has no idea about the expected value of a
boolean expression, we use 0.5.  On studying the CVS history, it looks
like I inserted that in revision 1.24 of clausesel.c, and I think that I
must have misinterpreted what was happening in the previous state of the
code:

static Cost
compute_selec(Query *root, List *clauses, List *or_selectivities)
{
    Cost        s1 = 0;
    List       *clause = lfirst(clauses);

    if (clause == NULL)
        s1 = 1.0;
    else if (IsA(clause, Param))
    {
        /* XXX How're we handling this before?? -ay */
        s1 = 1.0;
    }
    else if (IsA(clause, Const))
        s1 = ((bool) ((Const *) clause)->constvalue) ? 1.0 : 0.0;
    else if (IsA(clause, Var))
        ...

The reason for the first bit is that a NIL list of clauses means no
WHERE clause, so 1.0 selectivity is actually right in that case.
The 1.0 default for a Param is pretty silly (possibly AY made the same
mistake as me at some earlier point).  If you look further down in the
routine you see 0.5 being used in default cases, but if you don't study
the whole thing you might well think it was intending 1.0 as default.

So I think we should change the default to 0.5 for 8.3, but seeing that
it's been this way for 8 years (!) I'm a bit hesitant to back-patch.
Comments anyone?

In the meantime, Sim would probably have better luck if he restructured
this particular clause in some other way, say

	where f.commited is not true
or
	where f.commited = false or f.commited is null

Note also that he really ought to move up to 8.2.6, as 8.2.4 is not
very sane about what IS NULL means for a left join's result.

			regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend