On 10/02/2014 02:30 AM, Peter Geoghegan wrote: > On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon@xxxxxxxxxxxxxxx> wrote: >> Having read papers on it, I believe the problem is intractable. Coding >> is not the issue. To anyone: please prove me wrong, in detail, with >> references so it can be coded. > > I think it might be close to intractable if you're determined to use a > sampling model. HyperLogLog looks very interesting for n_distinct > estimation, though. My abbreviated key patch estimates the cardinality > of abbreviated keys (and original strings that are to be sorted) with > high precision and fixed overhead. Maybe we can figure out a way to > do opportunistic streaming of HLL. Believe it or not, the way I use > HLL for estimating cardinality is virtually free. Hashing is really > cheap when the CPU is bottlenecked on memory bandwidth. Yes, it's only intractable if you're wedded to the idea of a tiny, fixed-size sample. If we're allowed to sample, say, 1% of the table, we can get a MUCH more accurate n_distinct estimate using multiple algorithms, of which HLL is one. While n_distinct will still have some variance, it'll be over a much smaller range. The n_distinct algo we use in Postgres is specifically designed (by its author) to choose the smallest reasonable number of distinct values capable of producing the observed distribution. This made sense when we added it because we didn't have query plans where underestimating n_distinct produced a penalty. Now we do, and we ought to change algos. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance