A status update on this problem:
1.) Workarounds (setting n_distinct manually) are tested and - as far as
workarounds go - OK.
2.) Source of the problem and possible solution:
The source of these troubles is the sampling method employed in
src/backend/commands/analyze.c. Judging from Tom Lane's comment for the
original implementation in 2004 this has never been thought to be
perfect. Does anybody see a chance to improve that part? Should this
discussion be taken elsewhere? Is there any input from my side that
could help?
btw: I do find this problem to be very frequent in our databases. And
considering the commonplace conditions leading to it, I would expect
many systems to be affected. But searching the forums and the web I
hardly found any references to it - which amazes me to no end.
Best Regards,
Stefan
On 12/30/2012 07:02 PM, Stefan Andreatta wrote:
On 12/29/2012 10:57 PM, Peter Geoghegan wrote:
On 29 December 2012 20:57, Stefan Andreatta <s.andreatta@xxxxxxxxxxx>
wrote:
Now, the 2005 discussion goes into great detail on the advantages and
disadvantages of this algorithm, particularly when using small
sample sizes,
and several alternatives are discussed. I do not know whether
anything has
been changed after that, but I know that the very distinct problem,
which I
will focus on here, still persists.
It's a really hard problem to solve satisfactorily. It's a problem
that has been studied in much detail. Yes, the algorithm used is still
the same. See the comments within src/backend/commands/analyze.c (IBM
Research Report RJ 10025 is referenced there).
Thanks a lot for this information! I looked through the code a bit.
The Haas & Stokes Formula is fine. The problem really lies with the
two phase random selection procedure:
Starting from line 1039, there is a comment:
* As of May 2004 we use a new two-stage method: Stage one selects up
* to targrows random blocks (or all blocks, if there aren't so many).
* Stage two scans these blocks and uses the Vitter algorithm to create
* a random sample of targrows rows (or less, if there are less in the
* sample of blocks). The two stages are executed simultaneously: each
* block is processed as soon as stage one returns its number and while
* the rows are read stage two controls which ones are to be inserted
* into the sample.
*
* Although every row has an equal chance of ending up in the final
* sample, this sampling method is not perfect: not every possible
* sample has an equal chance of being selected. For large relations
* the number of different blocks represented by the sample tends to be
* too small. We can live with that for now. Improvements are welcome.
Now the problem with clustered data is, that the probability of
sampling a value twice is much higher when the same page is repeatedly
sampled. As stage one takes a random sample of pages, and stage two
samples rows from these pages, the probability of visiting the same
page twice (or more often) is much higher than if random rows were
selected from the whole table. Hence we get a lot more multiple values
for clustered data and we end up with the severe under-estimation we
can see in those cases.
Probabilities do my brain in, as usual, but I tested the procedure for
my test data with a simple python script. There is absolutely nothing
wrong with the implementation. It seems to be a purely statistical
problem.
Not everything may be hopeless though ;-) The problem could
theoretically be avoided if random rows were selected from the whole
table. Again, that may not be feasible - the two phase approach was
probably not implemented for nothing.
Another possible solution would be to avoid much of the resampling
(not all) in phase two. For that - in theory - every page visited
would have to get a lower weight, so that revisiting this page is not
any more likely as rows were selected from the whole column. That does
not sound easy or elegant to implement. But perhaps there is some
clever algorithm - unfortunately I do not know.
The general advice here is:
1) Increase default_statistics_target for the column.
2) If that doesn't help, consider using the following DDL:
alter table foo alter column bar set ( n_distinct = 5.0);
Yes, I will probably have to live with that for now - I will come back
to these workarounds with one or two questions.
Thanks again & Regards,
Seefan
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance