Re: Yet another abort-early plan disaster on 9.3

Peter Geoghegan <peter.geoghegan86@xxxxxxxxx> · Thu, 2 Oct 2014 02:30:02 -0700

On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon@xxxxxxxxxxxxxxx> wrote:
>> I disagree that (1) is not worth fixing just because we've provided
>> users with an API to override the stats.  It would unquestionably be
>> better for us to have a better n_distinct estimate in the first place.
>> Further, this is an easier problem to solve, and fixing n_distinct
>> estimates would fix a large minority of currently pathological queries.
>>  It's like saying "hey, we don't need to fix the leak in your radiator,
>> we've given you a funnel in the dashboard you can pour water into."
>
> Having read papers on it, I believe the problem is intractable. Coding
> is not the issue. To anyone: please prove me wrong, in detail, with
> references so it can be coded.

I think it might be close to intractable if you're determined to use a
sampling model. HyperLogLog looks very interesting for n_distinct
estimation, though. My abbreviated key patch estimates the cardinality
of abbreviated keys (and original strings that are to be sorted) with
high precision and fixed overhead.  Maybe we can figure out a way to
do opportunistic streaming of HLL. Believe it or not, the way I use
HLL for estimating cardinality is virtually free. Hashing is really
cheap when the CPU is bottlenecked on memory bandwidth.

If you're interested, download the patch, and enable the debug traces.
You'll see HyperLogLog accurately indicate the cardinality of text
datums as they're copied into local memory before sorting.

-- 
Regards,
Peter Geoghegan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance