On Thu, Mar 26, 2015 at 5:44 AM, Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> wrote:
On 26.3.2015 08:48, Jeff Janes wrote:
>
> OK, this is starting to look like a long-standing bug to me.
>
> If it only sees 3 distinct values, and all three are present at least
> twice, it throws all of them into the MCV list. But if one of those 3
> were present just once, then it tests them to see if they qualify.
> The test for inclusion is that it has to be present more than once,
> and that it must be "over-represented" by 25%.
>
> Lets say it sampled 30000 rows and found 29,900 of one value, 99 of
> another, and 1 of a third.
>
> But that turns into the second one needing to be present 12,500 times.
> The average value is present 10,000 times (30,000 samples with 3
> distinct values) and 25 more than that is 12,500. So it excluded.
>
> It seems to me that a more reasonable criteria is that it must be
> over-represented 25% compared to the average of all the remaining values
> not yet accepted into the MCV list. I.e. all the greater ones should be
> subtracted out before computing the over-representation threshold.
That might work IMO, but maybe we should increase the coefficient a bit
(say, from 1.25 to 2), not to produce needlessly long MCV lists.
That wouldn't work here, because at the point of decision the value present 99 times contributes half the average, so the average is 50, and of course it can't possibly be twice of that.
I have a patch, but is there a way to determine how it affects a wide variety of situations? I guess run `make installcheck`, then analyze, then dump pg_stats, with the patch and without the patch, and then compare the dumpsj?
> It is also grossly inconsistent with the other behavior. If they are
> "29900; 98; 2" then all three go into the MCV.
Isn't the mincount still 12500? How could all three get into the MCV?
If all observed values are observed at least twice, it takes a different path through the code. It just keeps them all in the MCV list. That is what is causing the instability for the OP. If the 3rd most common is seen twice, then all three are kept. If it is seen once, then only the most common is kept. See if statements at 2494 and 2585
else if (toowide_cnt == 0 && nmultiple == ndistinct)
if (track_cnt == ndistinct ....
Cheers,
Jeff
Attachment:
analyze_highly_skewed.patch
Description: Binary data
-- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance