On Wed, Oct 15, 2014 at 7:02 PM, Tomas Vondra <tv@xxxxxxxx> wrote: > If you know the title of the article, it's usually available elsewhere > on the web - either at the university site, or elsewhere. I found these > two articles about block-based sampling: > > > http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf There are a series of papers with Chaudhuri as lead author which I agree sounds like what Josh is talking about. Note that he's Microsoft Research's database group lead and it would be a pretty safe bet anything published from there is going to be covered by patents from here till next Tuesday (and seventeen years beyond). I think this is all putting the cart before the horse however. If we could fix our current sampling to use the data more efficiently that would be a good start before we start trying to read even more data. We currently read just one row from each block on average instead of using the whole block. That's what would be needed in the worst case if the blocks were a very biased sample (which indeed they probably are in most databases due to the way Postgres handles updates). But we could at least give users the option to use more than one row per block when they know it's ok (such as data that hasn't been updated) or detect when it's ok (such as by carefully thinking about how Postgres's storage mechanism would bias the data). But I looked into this and ran into a problem. I think our algorithm for calculating the most frequent values list is bunk. The more rows I picked from each block the more biased that list was towards values seen earlier in the table. What's worse, when that list was biased it threw off the histogram since we exclude the most frequent values from the histogram, so all estimates were thrown off. If we could fix the most frequent values collection to not be biased when it sees values in a clumpy way then I think we would be okay to set the row sample size in Vitter's algorithm to a factor of N larger than the block sample size where N is somewhat smaller than the average number of rows per block. In fact even if we used all the rows in the block I think I've convinced myself that the results would be accurate in most circumstances. I think to calcualte the most frequent values more accurately it would take a two pass approach. Scan some random sample of blocks with a counting bloom filter then do a second pass (possibly for the same sample?) keeping counts only for values that the counting bloom filter said hashed to the most common hash values. That might not be exactly the most common values but should be at least a representative sample of the most common values. -- greg -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance