Re: Yet another abort-early plan disaster on 9.3

Greg Stark <stark@xxxxxxx> · Fri, 17 Oct 2014 18:25:45 +0100

On Wed, Oct 15, 2014 at 7:02 PM, Tomas Vondra <tv@xxxxxxxx> wrote:
> If you know the title of the article, it's usually available elsewhere
> on the web - either at the university site, or elsewhere. I found these
> two articles about block-based sampling:
>
>
> http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf

There are a series of papers with Chaudhuri as lead author which I
agree sounds like what Josh is talking about. Note that he's Microsoft
Research's database group lead and it would be a pretty safe bet
anything published from there is going to be covered by patents from
here till next Tuesday (and seventeen years beyond).

I think this is all putting the cart before the horse however. If we
could fix our current sampling to use the data more efficiently that
would be a good start before we start trying to read even more data.

We currently read just one row from each block on average instead of
using the whole block. That's what would be needed in the worst case
if the blocks were a very biased sample (which indeed they probably
are in most databases due to the way Postgres handles updates). But we
could at least give users the option to use more than one row per
block when they know it's ok (such as data that hasn't been updated)
or detect when it's ok (such as by carefully thinking about how
Postgres's storage mechanism would bias the data).

But I looked into this and ran into a problem. I think our algorithm
for calculating the most frequent values list is bunk. The more rows I
picked from each block the more biased that list was towards values
seen earlier in the table. What's worse, when that list was biased it
threw off the histogram since we exclude the most frequent values from
the histogram, so all estimates were thrown off.

If we could fix the most frequent values collection to not be biased
when it sees values in a clumpy way then I think we would be okay to
set the row sample size in Vitter's algorithm to a factor of N larger
than the block sample size where N is somewhat smaller than the
average number of rows per block. In fact even if we used all the rows
in the block I think I've convinced myself that the results would be
accurate in most circumstances.

I think to calcualte the most frequent values more accurately it would
take a two pass approach. Scan some random sample of blocks with a
counting bloom filter then do a second pass (possibly for the same
sample?) keeping counts only for values that the counting bloom filter
said hashed to the most common hash values. That might not be exactly
the most common values but should be at least a representative sample
of the most common values.

-- 
greg

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance