On 2019-09-12 12:54:55 -0400, Tom Lane wrote: > "Peter J. Holzer" <hjp-pgsql@xxxxxx> writes: > > we'll consider just three columns, which we unimaginatively call a, b, > > and c. There are also three indexes: > > > t_a_idx btree (a) WHERE a IS NOT NULL > > t_b_idx btree (b) WHERE b IS NOT NULL > > t_a_b_idx btree (a, b) WHERE a IS NOT NULL AND b IS NOT NULL > > > Nowe I have a query > > select c from t where a='A' and b='B'; Just in case somebody wants to play along at home: A narrow table with just these three columns and synthetic data shows the problem but doesn't use the flipped multicolumn index either. I guess you have to pad the table a bit to sufficiently change the costs. > > This uses t_b_idx, not - as I expected - t_a_b_idx. > > > The distribution of values in columns a and b is quite different: a has > > 10 different values of similar frequency (and no null values). b has > > only a single non-null value which with a frequency of about 1 %. > > > So I definitely understand why it would prefer t_b_idx to t_a_idx, but > > certainly t_a_b_idx should be even better? > > Not necessarily --- t_a_b_idx is (presumably) physically bigger than > t_b_idx, which makes it more expensive to search. The additional > selectivity gain apparently doesn't outweigh that. The index is about 20 % larger, but only 1/10 of it has to be searched, so it should take 1 - (1.2 / 10) = 88 % less time. > > If I create an index with the columns swapped: > > t_b_a_idx btree (b, a) WHERE b IS NOT NULL and a IS NOT NULL > > this index will be used. > > Hmm. Probably that has something to do with a calculation about > the selectivity of the leading index column, ie do you have to > scan 10% of the index or 1% of the index. Does it use only the selectivity of the first column? That would explain it. The total selectivity of both columns is obviously the same. > It's not taking the partial-index filter into account in that, I > suspect, which skews the results in this case --- but that would be > hard to account for accurately. Hmm. Wouldn't that be a problem for partial indexes in general? They usually cover only a small portion of the table and if the selectivity is computed relative to the whole table the result may be way off. I think it should be possible to adjust for a "WHERE column IS NOT NULL" filter, because null_frac is in the statistics. For the general case you would need an estimate of the number of rows covered by the index, which I don't think we have. > Anyway I can't get excited about optimizing for a single non-null > value. That's certainly an extreme case, but I've seen bad plans for similar queries every now and then. At least now I have an idea what causes it and maybe a solution/workaround. hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@xxxxxx | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
Attachment:
signature.asc
Description: PGP signature