On 3.10.2014 21:58, Jeff Janes wrote: > On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@xxxxxxxxxxxx > <mailto:josh@xxxxxxxxxxxx>> wrote: > > Yes, it's only intractable if you're wedded to the idea of a tiny, > fixed-size sample. If we're allowed to sample, say, 1% of the table, we > can get a MUCH more accurate n_distinct estimate using multiple > algorithms, of which HLL is one. While n_distinct will still have some > variance, it'll be over a much smaller range. > > > In my hands, the problems with poor n_distinct were not due to the > insufficient size of the sample, but the insufficient randomness of it. > Increasing default_statistics_target did help but not because it > increases the number of rows sampled, but rather because it increases > the number of blocks sampled. Once substantially all of the blocks are > part of the block sampling, the bias is eliminated even though it was > still sampling a small fraction of the rows (roughly one per block). I don't think that's entirely accurate. According to [1], there's a lower boundary on ratio error, depending on the number of sampled rows. Say there's a table with 10M rows, we sample 30k rows (which is the default). Then with probability 5% we'll get ratio error over 20. That is, we may either estimate <5% or >200% of the actual ndistinct value. Combined with our arbitrary 10% limit that we use to decide whether ndistinct scales with the number of rows, this sometimes explodes. By increasing the statistics target, you get much larger sample and thus lower probability of such error. But nevertheless, it breaks from time to time, and the fact that statistics target is static (and not scaling with the size of the table to get appropriate sample size) is not really helping IMHO. Static sample size may work for histograms, for ndistinct not so much. [1] http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf > So one idea would be go get rid of the 2-stage sampling algorithm > (sample blocks, sample rows from the chosen blocks) and just read > the whole table and sample rows from it unbiased, at least under > some conditions. Some low level benchmarking on my favorite server > showed that reading 1% of a system's blocks (in block number order > within each file) was no faster than reading all of them from an IO > perspective. But that is a virtualized server that wasn't really > speced out to be an IO intensive database server in the first place. > It would be interesting to see what people get on real hardware that > they actually designed for the task. I think there was a discussion about the sampling on pgsql-hackers a while ago ... and yes, here it is [2]. However it seems there was no clear conclusion on how to change it at that time ... [2] http://www.postgresql.org/message-id/CA+TgmoZaqyGSuaL2v+YFVsX06DQDQh-pEV0nobGPws-dNwAwBw@xxxxxxxxxxxxxx regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance