Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a): > On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@xxxxxxxxxxxx> wrote: >> Yes, it's only intractable if you're wedded to the idea of a tiny, >> fixed-size sample. If we're allowed to sample, say, 1% of the table, we >> can get a MUCH more accurate n_distinct estimate using multiple >> algorithms, of which HLL is one. While n_distinct will still have some >> variance, it'll be over a much smaller range. > > I've gone looking for papers on this topic but from what I read this > isn't so. To get any noticeable improvement you need to read 10-50% of > the table and that's effectively the same as reading the entire table > -- and it still had pretty poor results. All the research I could find > went into how to analyze the whole table while using a reasonable > amount of scratch space and how to do it incrementally. I think it's really difficult to discuss the estimation without some basic agreement on what are the goals. Naturally, we can't get a perfect estimator with small samples (especially when the sample size is fixed and not scaling with the table). But maybe we can improve the estimates without scanning most of the table? FWIW I've been playing with the adaptive estimator described in [1] and the results looks really interesting, IMHO. So far I was testing it on synthetic datasets outside the database, but I plan to use it instead of our estimator, and do some more tests. Would be helpful to get a collection of test cases that currently perform poorly. I have collected a few from the archives, but if those who follow this thread can provide additional test cases / point to a thread describing related etc. that'd be great. It certainly won't be perfect, but if it considerably improves the estimates then I believe it's step forward. Ultimately, it's impossible to improve the estimates without increasing the sample size. [1] http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance