Re: serious under-estimation of n_distinct for clustered distributions

Stefan Andreatta <s.andreatta@xxxxxxxxxxx> · Mon, 14 Jan 2013 08:35:31 +0100

A status update on this problem:

1.) Workarounds (setting n_distinct manually) are tested and - as far as 
workarounds go - OK.

2.) Source of the problem and possible solution:

The source of these troubles is the sampling method employed in 
src/backend/commands/analyze.c. Judging from Tom Lane's comment for the 
original implementation in 2004 this has never been thought to be 
perfect. Does anybody see a chance to improve that part? Should this 
discussion be taken elsewhere? Is there any input from my side that 
could help?

btw: I do find this problem to be very frequent in our databases. And 
considering the commonplace conditions leading to it, I would expect 
many systems to be affected. But searching the forums and the web I 
hardly found any references to it - which amazes me to no end.

Best Regards,
Stefan

On 12/30/2012 07:02 PM, Stefan Andreatta wrote:
On 12/29/2012 10:57 PM, Peter Geoghegan wrote:
On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@xxxxxxxxxxx> 
wrote:
Now, the 2005 discussion goes into great detail on the advantages and
disadvantages of this algorithm, particularly when using small 
sample sizes,
and several alternatives are discussed. I do not know whether 
anything has
been changed after that, but I know that the very distinct problem, 
which I
will focus on here, still persists.

It's a really hard problem to solve satisfactorily. It's a problem
that has been studied in much detail. Yes, the algorithm used is still
the same. See the comments within src/backend/commands/analyze.c (IBM
Research Report RJ 10025 is referenced there).

Thanks a lot for this information! I looked through the code a bit. 
The Haas & Stokes Formula is fine. The problem really lies with the 
two phase random selection procedure:

Starting from line 1039, there is a comment:
 * As of May 2004 we use a new two-stage method: Stage one selects up
 * to targrows random blocks (or all blocks, if there aren't so many).
 * Stage two scans these blocks and uses the Vitter algorithm to create
 * a random sample of targrows rows (or less, if there are less in the
 * sample of blocks). The two stages are executed simultaneously: each
 * block is processed as soon as stage one returns its number and while
 * the rows are read stage two controls which ones are to be inserted
 * into the sample.
 *
 * Although every row has an equal chance of ending up in the final
 * sample, this sampling method is not perfect: not every possible
 * sample has an equal chance of being selected. For large relations
 * the number of different blocks represented by the sample tends to be
 * too small. We can live with that for now. Improvements are welcome.

Now the problem with clustered data is, that the probability of 
sampling a value twice is much higher when the same page is repeatedly 
sampled. As stage one takes a random sample of pages, and stage two 
samples rows from these pages, the probability of visiting the same 
page twice (or more often) is much higher than if random rows were 
selected from the whole table. Hence we get a lot more multiple values 
for clustered data and we end up with the severe under-estimation we 
can see in those cases.

Probabilities do my brain in, as usual, but I tested the procedure for 
my test data with a simple python script. There is absolutely nothing 
wrong with the implementation. It seems to be a purely statistical 
problem.

Not everything may be hopeless though ;-) The problem could 
theoretically be avoided if random rows were selected from the whole 
table. Again, that may not be feasible - the two phase approach was 
probably not implemented for nothing.

Another possible solution would be to avoid much of the resampling 
(not all) in phase two. For that - in theory - every page visited 
would have to get a lower weight, so that revisiting this page is not 
any more likely as rows were selected from the whole column. That does 
not sound easy or elegant to implement. But perhaps there is some 
clever algorithm - unfortunately I do not know.

The general advice here is:

1) Increase default_statistics_target for the column.

2) If that doesn't help, consider using the following DDL:

alter table foo alter column bar set ( n_distinct = 5.0);

Yes, I will probably have to live with that for now - I will come back 
to these workarounds with one or two questions.

Thanks again & Regards,
Seefan

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance