On 10/10/2014 04:16 AM, Greg Stark wrote: > On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@xxxxxxxxxxxx> wrote: >> Yes, it's only intractable if you're wedded to the idea of a tiny, >> fixed-size sample. If we're allowed to sample, say, 1% of the table, we >> can get a MUCH more accurate n_distinct estimate using multiple >> algorithms, of which HLL is one. While n_distinct will still have some >> variance, it'll be over a much smaller range. > > I've gone looking for papers on this topic but from what I read this > isn't so. To get any noticeable improvement you need to read 10-50% of > the table and that's effectively the same as reading the entire table > -- and it still had pretty poor results. All the research I could find > went into how to analyze the whole table while using a reasonable > amount of scratch space and how to do it incrementally. So, right now our estimation is off on large tables by -10X to -10000X. First, the fact that it's *always* low is an indication we're using the wrong algorithm. Second, we can most certainly do better than a median of -1000X. One interesting set of algorithms is block-based sampling. That is, you read 5% of the physical table in random blocks, reading every row in the block. The block size is determined by your storage block size, so you're not actually reading any more physically than you are logically; it really is just 5% of the table, especially on SSD. Then you apply algorithms which first estimate the correlation of common values in the block (i.e. how likely is it that the table is completely sorted?), and then estimates of how many values there might be total based on the correlation estimate. I no longer have my ACM membership, so I can't link this, but researchers were able to get +/- 3X accuracy for a TPCH workload using this approach. A real database would be more variable, of course, but even so we should be able to achieve +/- 50X, which would be an order of magnitude better than we're doing now. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance