Re: Weird behaviour in planner (PostgreSQL v 9.2.14)

Tom Lane <tgl@xxxxxxxxxxxxx> · Fri, 11 Dec 2015 12:17:25 -0500

Maxim Boguk <maxim.boguk@xxxxxxxxx> writes:
> [ planner changes behavior when a VALUES RTE reaches 200 elements ]

The immediate cause of that is that, lacking any real statistics for the
VALUES RTE, eqjoinsel_semi() will fall back to a rather dubious default
estimate if it believes it's looking at a default estimate for the number
of distinct entries on either side of the join clause:

        /*
         * Without MCV lists for both sides, we can only use the heuristic
         * about nd1 vs nd2.
         */
        double        nullfrac1 = stats1 ? stats1->stanullfrac : 0.0;

        if (!isdefault1 && !isdefault2)
        {
            if (nd1 <= nd2 || nd2 < 0)
                selec = 1.0 - nullfrac1;
            else
                selec = (nd2 / nd1) * (1.0 - nullfrac1);
        }
        else
===>        selec = 0.5 * (1.0 - nullfrac1);

And get_variable_numdistinct() changes its mind about whether it's issuing
a default estimate when the VALUES size reaches DEFAULT_NUM_DISTINCT (200):

    /*
     * With no data, estimate ndistinct = ntuples if the table is small, else
     * use default.  We use DEFAULT_NUM_DISTINCT as the cutoff for "small" so
     * that the behavior isn't discontinuous.
     */
    if (ntuples < DEFAULT_NUM_DISTINCT)
        return clamp_row_est(ntuples);

    *isdefault = true;
    return DEFAULT_NUM_DISTINCT;

So basically, although this is alleged to be continuous behavior, the
changeover from isdefault = false to isdefault = true causes a huge
change in the result from eqjoinsel_semi.

There are a couple of things we might choose to do about this:

1. get_variable_numdistinct doesn't currently pay any attention to what
kind of RTE it's considering, but it does have access to that.  We could
teach it to assume that if the Var comes from a VALUES RTE, the values
are all distinct regardless of the length of the VALUES list.  That's
effectively what it assumes for all VALUES of < 200 elements today, and
it's not apparent why we shouldn't make the same assumption for longer
lists.  Or we could adopt some sort of nonlinear behavior that gradually
reduces the assumed stadistinct fraction, but I don't know any rule for
that that wouldn't be pretty ad-hoc.

2. We could change the default don't-know-anything selectivity estimate in
eqjoinsel_semi to be something less crude than 0.5.  But again, it's hard
to say what to use instead.

The first of these ideas might be something that would be sane to
back-patch, but I'd be pretty hesitant about back-patching anything
along the lines of #2; the scope of the effects is hard to predict.

			regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general