On Wed, Mar 25, 2015 at 1:00 PM, Feike Steenbergen <feikesteenbergen@xxxxxxxxx> wrote:
On 25 March 2015 at 19:07, Jeff Janes <jeff.janes@xxxxxxxxx> wrote:
> Also, I doubt that that is the problem in the first place. If you collect a
> sample of 30,000 (which the default target size of 100 does), and the
> frequency of the second most common is really 0.00307333 at the time you
> sampled it, you would expect to find it 92 times in the sample. The chances
> against actually finding 1 instead of around 92 due to sampling error are
> astronomical.
It can be that the distribution of values is very volatile; we hope
the increased stats target (from the default=100 to 1000 for this
column) and frequent autovacuum and autoanalyze helps in keeping the
estimates correct.
It seems that it did find some other records (<> 'PRINTED), as is
demonstrated in the stats where there was only one value in the MCV
list: the frequency was 0.996567 and the fraction of nulls was 0,
therefore leaving 0.03+ for other values. But because none of them
were in the MCV and MCF list, they were all treated as equals. They
are certainly not equal.
I not know why some values were found (they are mentioned in the
histogram_bounds), but are not part of the MCV list, as you say, the
likeliness of only 1 item being found is very small.
Does anyone know the criteria for a value to be included in the MCV list?
OK, this is starting to look like a long-standing bug to me.
If it only sees 3 distinct values, and all three are present at least twice, it throws
all of them into the MCV list. But if one of those 3 were present just once, then it
tests them to see if they qualify. The test for inclusion is that it has to be present more than once, and that it must be "over-represented" by 25%.
Lets say it sampled 30000 rows and found 29,900 of one value, 99 of another, and 1 of a third.
But that turns into the second one needing to be present 12,500 times. The average value is present 10,000 times (30,000 samples with 3 distinct values) and 25 more than that is 12,500. So it excluded.
It seems to me that a more reasonable criteria is that it must be over-represented 25% compared to the average of all the remaining values not yet accepted into the MCV list. I.e. all the greater ones should be subtracted out before computing the over-representation threshold.
It is also grossly inconsistent with the other behavior. If they are "29900; 98; 2" then all three go into the MCV.
If they are "29900; 99; 1" then only the highest one goes in. The second one gets evicted for being slightly *more* popular.
This is around line 2605 of src/backend/commands/analyze.c in head.
Cheers,
Jeff