On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@xxxxxxxxxxxx> wrote: > Yes, it's only intractable if you're wedded to the idea of a tiny, > fixed-size sample. If we're allowed to sample, say, 1% of the table, we > can get a MUCH more accurate n_distinct estimate using multiple > algorithms, of which HLL is one. While n_distinct will still have some > variance, it'll be over a much smaller range. I think that HyperLogLog, as a streaming algorithm, will always require that the entire set be streamed. This doesn't need to be a big deal - in the case of my abbreviated key patch, it appears to basically be free because of the fact that we were streaming everything anyway. It's a very cool algorithm, with fixed overhead and constant memory usage. It makes very useful guarantees around accuracy. I have this intuition that even though I'm more or less not paying anything for a great cardinality estimate, it's kind of a shame that I still throw it away after the sort, each and every time. I have a hard time actually putting my finger on how it could be put to further use, though. And besides, this only helps if you happen to need to do a sort (or something that requires a sequential scan, since the cost certainly isn't anywhere near "free" when you didn't need to do that anyway). Our current lack of block-based sampling probably implies that we are almost as badly off as if we *did* a sequential scan. Not that I'm suggesting that we give up on the idea of sampling (which would be crazy). Streaming algorithms like HyperLogLog are very recent ideas, as these things go. I wouldn't be all that discouraged by the fact that it might not have been put to use in this way (for database statistics) by somebody before now. -- Regards, Peter Geoghegan -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance