Re: Strange Behaviour with multicolumn indexes

"Peter J. Holzer" <hjp-pgsql@xxxxxx> · Thu, 12 Sep 2019 21:04:25 +0200

On 2019-09-12 12:54:55 -0400, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pgsql@xxxxxx> writes:
> > we'll consider just three columns, which we unimaginatively call a, b,
> > and c. There are also three indexes:
> 
> >     t_a_idx   btree (a) WHERE a IS NOT NULL
> >     t_b_idx   btree (b) WHERE b IS NOT NULL
> >     t_a_b_idx btree (a, b) WHERE a IS NOT NULL AND b IS NOT NULL
> 
> > Nowe I have a query 
> >     select c from t where a='A' and b='B';

Just in case somebody wants to play along at home: A narrow table with
just these three columns and synthetic data shows the problem but
doesn't use the flipped multicolumn index either. I guess you have to
pad the table a bit to sufficiently change the costs.

> > This uses t_b_idx, not - as I expected - t_a_b_idx.
> 
> > The distribution of values in columns a and b is quite different: a has
> > 10 different values of similar frequency (and no null values). b has
> > only a single non-null value which with a frequency of about 1 %. 
> 
> > So I definitely understand why it would prefer t_b_idx to t_a_idx, but
> > certainly t_a_b_idx should be even better?
> 
> Not necessarily --- t_a_b_idx is (presumably) physically bigger than
> t_b_idx, which makes it more expensive to search.  The additional
> selectivity gain apparently doesn't outweigh that.

The index is about 20 % larger, but only 1/10 of it has to be searched,
so it should take 1 - (1.2 / 10) = 88 % less time.

> > If I create an index with the columns swapped:
> >     t_b_a_idx btree (b, a) WHERE b IS NOT NULL and a IS NOT NULL
> > this index will be used.
> 
> Hmm.  Probably that has something to do with a calculation about
> the selectivity of the leading index column, ie do you have to
> scan 10% of the index or 1% of the index.

Does it use only the selectivity of the first column? That would explain
it. The total selectivity of both columns is obviously the same.

> It's not taking the partial-index filter into account in that, I
> suspect, which skews the results in this case --- but that would be
> hard to account for accurately.

Hmm. Wouldn't that be a problem for partial indexes in general? They
usually cover only a small portion of the table and if the selectivity
is computed relative to the whole table the result may be way off.

I think it should be possible to adjust for a "WHERE column IS NOT NULL"
filter, because null_frac is in the statistics. For the general case you
would need an estimate of the number of rows covered by the index, which
I don't think we have.

> Anyway I can't get excited about optimizing for a single non-null
> value.

That's certainly an extreme case, but I've seen bad plans for similar
queries every now and then. At least now I have an idea what causes it
and maybe a solution/workaround.

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@xxxxxx         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment:
signature.asc

Description: PGP signature