On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon@xxxxxxxxxxxxxxx> wrote: >> I disagree that (1) is not worth fixing just because we've provided >> users with an API to override the stats. It would unquestionably be >> better for us to have a better n_distinct estimate in the first place. >> Further, this is an easier problem to solve, and fixing n_distinct >> estimates would fix a large minority of currently pathological queries. >> It's like saying "hey, we don't need to fix the leak in your radiator, >> we've given you a funnel in the dashboard you can pour water into." > > Having read papers on it, I believe the problem is intractable. Coding > is not the issue. To anyone: please prove me wrong, in detail, with > references so it can be coded. I think it might be close to intractable if you're determined to use a sampling model. HyperLogLog looks very interesting for n_distinct estimation, though. My abbreviated key patch estimates the cardinality of abbreviated keys (and original strings that are to be sorted) with high precision and fixed overhead. Maybe we can figure out a way to do opportunistic streaming of HLL. Believe it or not, the way I use HLL for estimating cardinality is virtually free. Hashing is really cheap when the CPU is bottlenecked on memory bandwidth. If you're interested, download the patch, and enable the debug traces. You'll see HyperLogLog accurately indicate the cardinality of text datums as they're copied into local memory before sorting. -- Regards, Peter Geoghegan -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance