most_common_values before the ANALYZE had 22 values. After ANALYZE it has 23 values.
After ANALYZE I get an entry for "92" with 0.0441333 frequency (the frequency is about right).
The stats target for the "Ver" column is already at 10000. I'm going to have to bring the stats target back on everything, but I'm not sure about this. The life-cycle of this table is a bit special. Once-in-a-while a new "Version" is created: 1 to 3 million records are inserted at once, all with the same Version and with sequential Id-s (re-starting from 1 with each version). The unfortunate side-effect is that I get huge clusters of records with the same "Ver". I created a script that calculates the correct "n_distinct" value for the column and repeatedly runs ANALYZE until the reported "n_distinct" value is grater then 75% of the correct number; on each loop of the script the stats target is increased by 5%. I thought this would help me find a good value for the stats target but it invariably brings the stats target all the way up to 10000.
Finally I have one last option: take "stats" into my own hands. Since inserting anything into those tables is such a big (but rare and well defined) event, I could simply set the stats target to ZERO and compute correct values on my own after pushing a new version. The issue here is that I don't understand the system well-enough to make this work.
Hopefully I'll be able to reproduce this on a backup of the database so I can safely experiment. Until I manage to reproduce this I don't think I can make any more progress, so thank you everyone for the help.
On Thu, 16 Jan 2020 at 20:55, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
Cosmin Prund <cprund@xxxxxxxxx> writes:
> Running the same query with a different "Ver" produces a proper plan.
Oh, that *is* interesting.
After studying the code a bit more I see why this is possible when I
originally thought not. The case that you are interested in is one that
has special handling -- it's a "lower-order ScalarArrayOpExpr" in the
terms of the code. This means that get_index_paths will actually produce
two index paths, one with the IN clause as an indexqual and one without,
because it expects that they have different sort behaviors [1]. So then
we do have a chance for a cost-based choice, making it possible for the
estimated selectivity of the higher-order clause to affect the outcome.
I'm still a bit surprised that it wouldn't choose the alternative with
the IN ... but if the estimated number of rows matching just the first
column is small enough, it might see the paths as having indistinguishable
costs, and then it's down to luck which it chooses.
> There are 25 valid values for "Ver" in this database. I ran the query for
> all of them. The only one miss-behaving is "92". I ran the query with
> random values for Ver (invalid values), the query plan always attempts to
> use the index using both values.
> I looked into "most_common_values" in pg_stats, this value (92) is not in
> that list.
Are the other 24 all in the list?
> Finally I ran "ANALYZE" again and now the problem went away. Running the
> query with Ver=92 uses the proper plan. I'm not happy with this - I know I
> haven't solved the problem (I've ran ANALYZE multiple times before).
Maybe increasing the stats target for the "Ver" column would help. It
sounds like you want to get to a point where all the valid values are
given in the MCV list, so that the estimates for them will be accurate.
regards, tom lane
[1] Right at the moment, it seems like that's wrong and we could just
generate one path. Need to study this.