Re: serious under-estimation of n_distinct for clustered distributions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/29/2012 10:57 PM, Peter Geoghegan wrote:
On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@xxxxxxxxxxx> wrote:
Now, the 2005 discussion goes into great detail on the advantages and
disadvantages of this algorithm, particularly when using small sample sizes,
and several alternatives are discussed. I do not know whether anything has
been changed after that, but I know that the very distinct problem, which I
will focus on here, still persists.

It's a really hard problem to solve satisfactorily. It's a problem
that has been studied in much detail. Yes, the algorithm used is still
the same. See the comments within src/backend/commands/analyze.c (IBM
Research Report RJ 10025 is referenced there).

Thanks a lot for this information! I looked through the code a bit. The Haas & Stokes Formula is fine. The problem really lies with the two phase random selection procedure:

Starting from line 1039, there is a comment:
 * As of May 2004 we use a new two-stage method: Stage one selects up
 * to targrows random blocks (or all blocks, if there aren't so many).
 * Stage two scans these blocks and uses the Vitter algorithm to create
 * a random sample of targrows rows (or less, if there are less in the
 * sample of blocks). The two stages are executed simultaneously: each
 * block is processed as soon as stage one returns its number and while
 * the rows are read stage two controls which ones are to be inserted
 * into the sample.
 *
 * Although every row has an equal chance of ending up in the final
 * sample, this sampling method is not perfect: not every possible
 * sample has an equal chance of being selected. For large relations
 * the number of different blocks represented by the sample tends to be
 * too small. We can live with that for now. Improvements are welcome.


Now the problem with clustered data is, that the probability of sampling a value twice is much higher when the same page is repeatedly sampled. As stage one takes a random sample of pages, and stage two samples rows from these pages, the probability of visiting the same page twice (or more often) is much higher than if random rows were selected from the whole table. Hence we get a lot more multiple values for clustered data and we end up with the severe under-estimation we can see in those cases.

Probabilities do my brain in, as usual, but I tested the procedure for my test data with a simple python script. There is absolutely nothing wrong with the implementation. It seems to be a purely statistical problem.

Not everything may be hopeless though ;-) The problem could theoretically be avoided if random rows were selected from the whole table. Again, that may not be feasible - the two phase approach was probably not implemented for nothing.

Another possible solution would be to avoid much of the resampling (not all) in phase two. For that - in theory - every page visited would have to get a lower weight, so that revisiting this page is not any more likely as rows were selected from the whole column. That does not sound easy or elegant to implement. But perhaps there is some clever algorithm - unfortunately I do not know.


The general advice here is:

1) Increase default_statistics_target for the column.

2) If that doesn't help, consider using the following DDL:

alter table foo alter column bar set ( n_distinct = 5.0);

Yes, I will probably have to live with that for now - I will come back to these workarounds with one or two questions.

Thanks again & Regards,
Seefan


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux