On 3.10.2014 02:54, Peter Geoghegan wrote: > On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@xxxxxxxxxxxx> wrote: >> Yes, it's only intractable if you're wedded to the idea of a tiny, >> fixed-size sample. If we're allowed to sample, say, 1% of the >> table, we can get a MUCH more accurate n_distinct estimate using >> multiple algorithms, of which HLL is one. While n_distinct will >> still have some variance, it'll be over a much smaller range. > > I think that HyperLogLog, as a streaming algorithm, will always > require that the entire set be streamed. This doesn't need to be a > big deal - in the case of my abbreviated key patch, it appears to > basically be free because of the fact that we were streaming > everything anyway. It's a very cool algorithm, with fixed overhead > and constant memory usage. It makes very useful guarantees around > accuracy. I think you're mixing two things here - estimating the number of distinct values in a sample (which can be done very efficiently using HLL) and estimating the number of distinct values in the whole table. For that HLL is not usable, unless you process all the data. Sadly HLL is rather incompatible with the usual estimators, because the ones I'm aware of need to know the number of occurences for the distinct values etc. But couldn't we just piggyback this on autovacuum? One of the nice HLL features is that it's additive - you can build "partial counters" for ranges of blocks (say, a few MBs per range), and then merge them when needed. By keeping the parts it's possible to rebuild it separately. But maybe this idea is way too crazy ... regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance