On 15.10.2014 19:20, Josh Berkus wrote: > On 10/10/2014 04:16 AM, Greg Stark wrote: >> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@xxxxxxxxxxxx> wrote: >>> Yes, it's only intractable if you're wedded to the idea of a tiny, >>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we >>> can get a MUCH more accurate n_distinct estimate using multiple >>> algorithms, of which HLL is one. While n_distinct will still have some >>> variance, it'll be over a much smaller range. >> >> I've gone looking for papers on this topic but from what I read this >> isn't so. To get any noticeable improvement you need to read 10-50% of >> the table and that's effectively the same as reading the entire table >> -- and it still had pretty poor results. All the research I could find >> went into how to analyze the whole table while using a reasonable >> amount of scratch space and how to do it incrementally. > > So, right now our estimation is off on large tables by -10X to > -10000X. First, the fact that it's *always* low is an indication > we're using the wrong algorithm. Second, we can most certainly do > better than a median of -1000X. A few days ago I posted an experimental patch with the adaptive estimator, described in [1]. Not perfect, but based on the testing I did I believe it's a superior algorithm to the one we use now. Would be nice to identify a few datasets where the current estimate is way off. [1] http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf > One interesting set of algorithms is block-based sampling. That is, > you read 5% of the physical table in random blocks, reading every row > in the block. The block size is determined by your storage block > size, so you're not actually reading any more physically than you are > logically; it really is just 5% of the table, especially on SSD. > > Then you apply algorithms which first estimate the correlation of > common values in the block (i.e. how likely is it that the table is > completely sorted?), and then estimates of how many values there > might be total based on the correlation estimate. I think we might also use a different approach - instead of sampling the data when ANALYZE kicks in, we might collect a requested sample of rows on the fly. Say we want 1% sample - whenever you insert a new row, you do [random() < 0.01] and if it happens to be true you keep a copy of the row aside. Then, when you need the sample, you simply read the sample and you're done - no random access to the main table, no problems with estimated being off due to block-level sampling, etc. Not sure how to track deletions/updates, though. Maybe rebuilding the sample if the number of deletions exceeds some threshold, but that contradicts the whole idea a bit. > I no longer have my ACM membership, so I can't link this, but > researchers were able to get +/- 3X accuracy for a TPCH workload > using this approach. A real database would be more variable, of > course, but even so we should be able to achieve +/- 50X, which > would be an order of magnitude better than we're doing now. If you know the title of the article, it's usually available elsewhere on the web - either at the university site, or elsewhere. I found these two articles about block-based sampling: http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf https://www.stat.washington.edu/research/reports/1999/tr355.pdf Maybe there are more, but most of the other links were about how Oracle does this in 11g. Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance