qnex42@xxxxxxxxx (Dawid Kuroczko) writes: > ALTER TABLE foo ALTER COLUMN bar SET STATISTICS n; ..... > > I wonder what are the implications of using this statement, > I know by using, say n=100, ANALYZE will take more time, > pg_statistics will be bigger, planner will take longer time, > on the other hand it will make better decisions... Etc, etc. > > I wonder however when it is most uselful to bump it up. > Please tell me what you think about it: > > Is bumping up statistics is only useful for indexed columns? The main decision changes that result from this would occur then... > When is it most useful/benefitial to bump them up: > > 1) huge table with huge number of distinct values (_almost_ > unique ;)) > > 2) huge table with relatively equally distributed values > (like each value is in between, say, 30-50 rows). > > 3) huge table with unequally distributed values (some > values are in 1-5 rows, some are in 1000-5000 rows). > > 4) huge table with small number values (around ~100 > distinct values, equally or uneqally distributed). A hard and fast rule hasn't emerged, definitely not to distinguish precisely between these cases. There are two effects that come out of changing the numbers: 1. They increase the number of tuples examined. This would pointedly affect cases 3 and 4, increasing the likelihood that the statistics are more representative 2. They increase the number of samples that are kept, increasing the number of items recorded in the histogram. If you have on the order of 100 unique values (it would not be unusual for a company to have 100 "main" customers or suppliers), that allows there to be nearly a bin apiece, which makes estimates _way_ more representative both for common and less common cases amongst the "top 100." Both of those properties are useful for pretty much all of the above cases. > 5) boolean column. Boolean column would more or less indicate SET STATISTICS 2; the only point to having more would be if there was one of the values that almost never occurred so that you'd need to collect more stats to even pick up instances of the "rare" case. A boolean column is seldom much use for indices anyways... > I think SET STATISTICS 100 is very useful for case with unequally > distributed values, but I wonder what about the other cases. And as > a side note -- what are the reasonable bounds for statistics > (between 10 and 100?) If there are, say, 200 unique values, then increasing from 10 to 100 would seem likely to be useful in making the histogram MUCH more representative... > What are the runtime implications of setting statistics too large -- > how much can it affect queries? More stats would mean a bit more time evaluating query plans, but the quality of the plans should be better. > And finally -- how other RDBMS and RDBM-likes deal with this issue? > :) For Oracle and DB/2, the issues are not dissimilar. Oracle somewhat prefers the notion of collecting comprehensive statistics on the whole table, which will be even more costly than PostgreSQL's sampling. -- let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];; http://www.ntlug.org/~cbbrowne/linuxxian.html A VAX is virtually a computer, but not quite.