Re: n_distinct off by a factor of 1000

Michael Lewis <mlewis@xxxxxxxxxxx> · Tue, 23 Jun 2020 11:33:26 -0600

> > On 23/06/2020 14:42, Klaudie Willis wrote:

> >

> > > I got my first hint of why this problem occurs when I looked at the

> > > statistics.  For the column in question, "instrument_ref" the

> > > statistics claimed it to be:

> > > The default_statistics_target=500, and analyze has been run.

> > > select * from pg_stats where attname like 'instr%_ref'; -- Result:

> > > 40.000

> > > select count(distinct instrumentid_ref) from bigtable -- Result: 33

> > > 385 922 (!!)That is an astonishing difference of almost a 1000X.

Try something like this to check how representative those "most common values" are. If you have n_distinct very low compared to reality and also the fraction of the table that the "most common" values are claiming to cover is low, then you can get very bad estimates when querying for values that are not in the MCVs list. The planner will assume an even distribution for other values and that may be much much higher or lower than reality. That is, if you have statistics target of 100 like normal, and those cover 5% of the table, and you have ndistinct value of 500, then the other 400 values are assumed to evenly cover that 95% of the table so each value would be .95/400 * reltuples as an estimate. If your real count of distinct values is 40000 then the number of values you expect to get for each value in your IN clause drops hugely.

Using a custom ndistinct will dramatically impact the estimates that the planner is using to make the decision of index vs sequential scan. Also, if the custom ndistinct and the actual distinct count vary by 2x or 10x as your data grows, it matters very little IMO as compared to relying on the sample taken by (auto)analyze job being off by a factor of 1000x or even 100x as you have experienced.

SELECT
	( SELECT SUM (x) FROM UNNEST (most_common_freqs) x ) frac_MCV,
	tablename,
	attname,
	inherited,
	null_frac,
	n_distinct,
	array_length(most_common_vals,1) n_mcv,
	array_length(histogram_bounds,1) n_hist,
	correlation,
	*
FROM pg_stats
WHERE
	schemaname = 'public'
	AND tablename=‘table’
	AND attname=‘column’;