On 17.10.2014 19:25, Greg Stark wrote: > On Wed, Oct 15, 2014 at 7:02 PM, Tomas Vondra <tv@xxxxxxxx> wrote: >> If you know the title of the article, it's usually available >> elsewhere on the web - either at the university site, or elsewhere. >> I found these two articles about block-based sampling: >> >> >> http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf > >> > There are a series of papers with Chaudhuri as lead author which I > agree sounds like what Josh is talking about. Note that he's > Microsoft Research's database group lead and it would be a pretty > safe bet anything published from there is going to be covered by > patents from here till next Tuesday (and seventeen years beyond). Hmmm. I have 0 experience with handling patents and related issues. Any idea how to address that? > I think this is all putting the cart before the horse however. If we > could fix our current sampling to use the data more efficiently that > would be a good start before we start trying to read even more data. > > We currently read just one row from each block on average instead of > using the whole block. That's what would be needed in the worst case > if the blocks were a very biased sample (which indeed they probably > are in most databases due to the way Postgres handles updates). But > we could at least give users the option to use more than one row per > block when they know it's ok (such as data that hasn't been updated) > or detect when it's ok (such as by carefully thinking about how > Postgres's storage mechanism would bias the data). I think this will be very tricky, and in fact it may make the estimates much worse easily, because all the algorithms assume random sampling. For example the ndistinct estimator uses the low-frequency values (that were observed only once or twice in the sample). By using multiple rows from each block, you'll significantly influence this probability for columns with values correlated to block (which is quite common. Take for example fact tables in data warehouses - those are usually denormalized, mostly append-only. Say each row has "date_id" which is a sequential number of a day, with 0 sometime in the past. Daily increments are usually stored on many consecutive blocks, so on each block there's usually a single date_id value. By sampling all rows on a block you gain exactly nothing, and in fact it results in observing no low-frequency values, making the estimator absolutely useless. I can imagine fixing this (although I don't see how exactly), but the thing is we need to fix *all* the estimators we have, not just ndistinct. And that's going to be tough. I don't think adding a knob to tune the number of tuples sampled per block is a good approach. Either we can solve the issues I described (and in that case it's unnecessary), or we can't solve them and it turns into a massive foot gun. > But I looked into this and ran into a problem. I think our algorithm > for calculating the most frequent values list is bunk. The more rows > I picked from each block the more biased that list was towards > values seen earlier in the table. What's worse, when that list was > biased it threw off the histogram since we exclude the most frequent > values from the histogram, so all estimates were thrown off. I think the 'minimal' stats (when we have just '=' for the type) behaves like this, but fixing it by switching to a two-pass approach should not be that difficult (but would cost a few more CPU cycles). Or do you suggest that even the scalar MCV algorithm behaves has this bias issue? I doubt that, because the MCV works with an array sorted by number of occurences, so the position within the table is irrelevant. > If we could fix the most frequent values collection to not be biased > when it sees values in a clumpy way then I think we would be okay to > set the row sample size in Vitter's algorithm to a factor of N > larger than the block sample size where N is somewhat smaller than > the average number of rows per block. In fact even if we used all the > rows in the block I think I've convinced myself that the results > would be accurate in most circumstances. I don't expect fixing the MCV to be overly difficult (although it will need a few more CPU cycles). But making it work with the block sampling will be much harder, because of the bias. The phrase 'in most circumstances' doesn't sound really convincing to me ... > I think to calcualte the most frequent values more accurately it > would take a two pass approach. Scan some random sample of blocks > with a counting bloom filter then do a second pass (possibly for the > same sample?) keeping counts only for values that the counting bloom > filter said hashed to the most common hash values. That might not be > exactly the most common values but should be at least a > representative sample of the most common values. I don't see why the counting bloom filter would be necessary, in a two pass approach? regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance