Re: bloom filter thoughts

Sage Weil <sage@xxxxxxxxxxx> · Thu, 26 Sep 2013 08:52:58 -0700 (PDT)

On Thu, 26 Sep 2013, Mark Nelson wrote:
> On 09/25/2013 07:34 PM, Sage Weil wrote:
> > I spent some time on the plane playing with bloom filters.  We're looking
> > at using these on the OSD to (more) efficiently keep track of which
> > objects have (probably) been touched in recent history.  This can then be
> > used by the caching and tiering code to identify objects that have
> > (definitely) not been touched that should be demoted or purged from a
> > given tier.
> 
> You know this is making me want to work on my photon mapping engine again
> instead of running benchmarks for Bryan. :P  Anyway, this seems like a great
> use case.  Are there any other places we want to fast intersection tests where
> we could use these?  k-nearest-neighbor searches would be tons of fun to do
> too if we only had a use case!
> 
> > 
> > The false positive probility (in our case, probability we'll think an
> > object has been touched when it actually hasn't) is dependent on size of
> > the bloom filter and the number of insertions.
> > 
> > When you create the bloom_filter, you tell it how many insertions you
> > expect to do and what fpp you want and it sizes itself.  For ~1% fpp, it
> > costs just over 1 byte per insertion.
> 
> It's been a long time since I looked at bloom filters, but isn't this also
> affected by the number of hash functions used?  Are we going to try to have it
> auto-select the number?

The bloom_filter class hides this.  You feed it the desired fpp and 
insertion count and it chooses the number of hashes (actually, 1 hash but 
N salts) and table size.  Unit tests verify the actual fpp very closely 
matches the target.

> 
> > 
> > wip-bloom has some cleanups and unit tests that test the false positive
> > behavior.
> > 
> > A few issues/notes:
> > 
> > - If we build a sequence of filters covering intervals of time, this let's
> > us estimate "atime".  However, if we have N filters of 1/Nth the size, our
> > total_ffp = N * fpp, so each bin's bloom needs to be 1/Nth as big with
> > 1/Nth the desired fpp.  In my simple test, going from 1 to 16 bins moves
> > the bloom_filter size from 20 KB to 32 KB (that's 16k insertions and
> > target_ffp of .01).
> > 
> > - In the OSD, we won't necessarily know how many insertions a bin will
> > have.  The user inputs will/should be something like:
> >   - target fpp
> >   - time interval we want to record/estimate over (say, 1 day)
> > and we'll need to translate that into a particular sized bloom.  We
> > probably need to estimate that based on recent pg workload/activity.  If
> > we have a filter that "fills up" quickly, the next one will be bigger.  If
> > we have an unfull one after some period of time, we can compress/fold it
> > down into a smaller filter (a convenient trick) and start the next smaller
> > one.
> 
> So basically we try to keep a running log of the PG activity so we can guess
> as to how big future filters should be, and if our guess is short we move on
> and make the next one bigger, but if we overshot we compress it into a smaller
> filter and make the next one smaller?  I think this is exactly what you said
> but I want to make sure I got it. :)

Exactly.

> > 
> > I suggest the pg_pool_t get:
> > 
> >   uint8_t bloom_fpp_pow;  //< desired false positive rate ==
> > 2^(-bloom_fpp_pow)
> >   uint16_t bloom_period;  //< period for that false positive rate (seconds)
> >   uint16_t bloom_bins;    //< the approx "resolution" we want
> >   uint16_t bloom_max;     //< how many periods we retain history for
> > 
> > which would make the OSD try to create a new filter roughly every
> > (period/bins) seconds with some simple estimation.
> > 
> > Other thoughts on the OSD side stuff:
> > 
> > - The past history of filters may be large.  I suggest we stick them in an
> > omap object that lives with the PG and is recovered/backfilled but that
> > needs to happen in a way that is out of band wrt to other objects.  It
> > should get scrubbed.
> 
> Any guesses as to how large this will get yet?  It's too early in the morning
> for me to do math...

Greg did a quick calc yesterday, I forget.  It was enough that it might 
fit in RAM for a spinning disk for 1-2 days of history, but only 
marginally.

sage

> > 
> > - Before anything is trimmed from the pg log it needs to be added to the
> > currently-accumulating bloom filter.  We can delay this until (well) after
> > last_complete_on_disk so that we never have to worry about divergent
> > entries.
> > 
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html