Re: Yet another abort-early plan disaster on 9.3

Peter Geoghegan <peter.geoghegan86@xxxxxxxxx> · Thu, 2 Oct 2014 17:54:06 -0700

On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@xxxxxxxxxxxx> wrote:
> Yes, it's only intractable if you're wedded to the idea of a tiny,
> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
> can get a MUCH more accurate n_distinct estimate using multiple
> algorithms, of which HLL is one.  While n_distinct will still have some
> variance, it'll be over a much smaller range.

I think that HyperLogLog, as a streaming algorithm, will always
require that the entire set be streamed. This doesn't need to be a big
deal - in the case of my abbreviated key patch, it appears to
basically be free because of the fact that we were streaming
everything anyway. It's a very cool algorithm, with fixed overhead and
constant memory usage. It makes very useful guarantees around
accuracy.

I have this intuition that even though I'm more or less not paying
anything for a great cardinality estimate, it's kind of a shame that I
still throw it away after the sort, each and every time. I have a hard
time actually putting my finger on how it could be put to further use,
though. And besides, this only helps if you happen to need to do a
sort (or something that requires a sequential scan, since the cost
certainly isn't anywhere near "free" when you didn't need to do that
anyway).

Our current lack of block-based sampling probably implies that we are
almost as badly off as if we *did* a sequential scan. Not that I'm
suggesting that we give up on the idea of sampling (which would be
crazy).

Streaming algorithms like HyperLogLog are very recent ideas, as these
things go. I wouldn't be all that discouraged by the fact that it
might not have been put to use in this way (for database statistics)
by somebody before now.
-- 
Regards,
Peter Geoghegan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance