Re: Strange Behaviour with multicolumn indexes

Tom Lane <tgl@xxxxxxxxxxxxx> · Thu, 12 Sep 2019 12:54:55 -0400

"Peter J. Holzer" <hjp-pgsql@xxxxxx> writes:
> we'll consider just three columns, which we unimaginatively call a, b,
> and c. There are also three indexes:

>     t_a_idx   btree (a) WHERE a IS NOT NULL
>     t_b_idx   btree (b) WHERE b IS NOT NULL
>     t_a_b_idx btree (a, b) WHERE a IS NOT NULL AND b IS NOT NULL

> Nowe I have a query 
>     select c from t where a='A' and b='B';

> This uses t_b_idx, not - as I expected - t_a_b_idx.

> The distribution of values in columns a and b is quite different: a has
> 10 different values of similar frequency (and no null values). b has
> only a single non-null value which with a frequency of about 1 %. 

> So I definitely understand why it would prefer t_b_idx to t_a_idx, but
> certainly t_a_b_idx should be even better?

Not necessarily --- t_a_b_idx is (presumably) physically bigger than
t_b_idx, which makes it more expensive to search.  The additional
selectivity gain apparently doesn't outweigh that.

> If I create an index with the columns swapped:
>     t_b_a_idx btree (b, a) WHERE b IS NOT NULL and a IS NOT NULL
> this index will be used.

Hmm.  Probably that has something to do with a calculation about
the selectivity of the leading index column, ie do you have to
scan 10% of the index or 1% of the index.  It's not taking the
partial-index filter into account in that, I suspect, which skews
the results in this case --- but that would be hard to account
for accurately.  Anyway I can't get excited about optimizing for
a single non-null value.

			regards, tom lane