Re: Yet another abort-early plan disaster on 9.3

Josh Berkus <josh@xxxxxxxxxxxx> · Thu, 02 Oct 2014 12:56:27 -0700

On 10/02/2014 02:30 AM, Peter Geoghegan wrote:
> On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon@xxxxxxxxxxxxxxx> wrote:
>> Having read papers on it, I believe the problem is intractable. Coding
>> is not the issue. To anyone: please prove me wrong, in detail, with
>> references so it can be coded.
> 
> I think it might be close to intractable if you're determined to use a
> sampling model. HyperLogLog looks very interesting for n_distinct
> estimation, though. My abbreviated key patch estimates the cardinality
> of abbreviated keys (and original strings that are to be sorted) with
> high precision and fixed overhead.  Maybe we can figure out a way to
> do opportunistic streaming of HLL. Believe it or not, the way I use
> HLL for estimating cardinality is virtually free. Hashing is really
> cheap when the CPU is bottlenecked on memory bandwidth.

Yes, it's only intractable if you're wedded to the idea of a tiny,
fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
can get a MUCH more accurate n_distinct estimate using multiple
algorithms, of which HLL is one.  While n_distinct will still have some
variance, it'll be over a much smaller range.

The n_distinct algo we use in Postgres is specifically designed (by its
author) to choose the smallest reasonable number of distinct values
capable of producing the observed distribution.  This made sense when we
added it because we didn't have query plans where underestimating
n_distinct produced a penalty.  Now we do, and we ought to change algos.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance