Strange Behaviour with multicolumn indexes

"Peter J. Holzer" <hjp-pgsql@xxxxxx> · Thu, 12 Sep 2019 18:41:05 +0200

[PostgreSQL 11.5 (Ubuntu 11.5-1.pgdg18.04+1) on x86_64-pc-linux-gnu]

I have a table with many columns and many indexes (actually many tables
with many columns and many indexes), but for the sake of this posting,
we'll consider just three columns, which we unimaginatively call a, b,
and c. There are also three indexes:

    t_a_idx   btree (a) WHERE a IS NOT NULL
    t_b_idx   btree (b) WHERE b IS NOT NULL
    t_a_b_idx btree (a, b) WHERE a IS NOT NULL AND b IS NOT NULL

Nowe I have a query 
    select c from t where a='A' and b='B';

This uses t_b_idx, not - as I expected - t_a_b_idx.

If I create an index with the columns swapped:

    t_b_a_idx btree (b, a) WHERE b IS NOT NULL and a IS NOT NULL

this index will be used.

The distribution of values in columns a and b is quite different: a has
10 different values of similar frequency (and no null values). b has
only a single non-null value which with a frequency of about 1 %. 

So I definitely understand why it would prefer t_b_idx to t_a_idx, but
certainly t_a_b_idx should be even better? After all it would have to
read only 1/1000 of the rows instead of 1/100. it would also have to
scan much less of the index, so the fact the fact that the index is a
bit larger shouldn't make a difference.

Explain shows that the row estimates are spot on, but the cost for using
t_a_b_idx is higher than for t_b_idx (which is in turn higher than for
t_b_a_idx).

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@xxxxxx         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment:
signature.asc

Description: PGP signature