Re: bloom filter thoughts

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 26 Sep 2013 09:06:10 -0700

On Thu, Sep 26, 2013 at 8:52 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> On Thu, 26 Sep 2013, Mark Nelson wrote:
>> On 09/25/2013 07:34 PM, Sage Weil wrote:
>> > I spent some time on the plane playing with bloom filters.  We're looking
>> > at using these on the OSD to (more) efficiently keep track of which
>> > objects have (probably) been touched in recent history.  This can then be
>> > used by the caching and tiering code to identify objects that have
>> > (definitely) not been touched that should be demoted or purged from a
>> > given tier.
>>
>> You know this is making me want to work on my photon mapping engine again
>> instead of running benchmarks for Bryan. :P  Anyway, this seems like a great
>> use case.  Are there any other places we want to fast intersection tests where
>> we could use these?  k-nearest-neighbor searches would be tons of fun to do
>> too if we only had a use case!
>>
>> >
>> > The false positive probility (in our case, probability we'll think an
>> > object has been touched when it actually hasn't) is dependent on size of
>> > the bloom filter and the number of insertions.
>> >
>> > When you create the bloom_filter, you tell it how many insertions you
>> > expect to do and what fpp you want and it sizes itself.  For ~1% fpp, it
>> > costs just over 1 byte per insertion.
>>
>> It's been a long time since I looked at bloom filters, but isn't this also
>> affected by the number of hash functions used?  Are we going to try to have it
>> auto-select the number?
>
> The bloom_filter class hides this.  You feed it the desired fpp and
> insertion count and it chooses the number of hashes (actually, 1 hash but
> N salts) and table size.  Unit tests verify the actual fpp very closely
> matches the target.
>
>>
>> >
>> > wip-bloom has some cleanups and unit tests that test the false positive
>> > behavior.
>> >
>> > A few issues/notes:
>> >
>> > - If we build a sequence of filters covering intervals of time, this let's
>> > us estimate "atime".  However, if we have N filters of 1/Nth the size, our
>> > total_ffp = N * fpp, so each bin's bloom needs to be 1/Nth as big with
>> > 1/Nth the desired fpp.  In my simple test, going from 1 to 16 bins moves
>> > the bloom_filter size from 20 KB to 32 KB (that's 16k insertions and
>> > target_ffp of .01).
>> >
>> > - In the OSD, we won't necessarily know how many insertions a bin will
>> > have.  The user inputs will/should be something like:
>> >   - target fpp
>> >   - time interval we want to record/estimate over (say, 1 day)
>> > and we'll need to translate that into a particular sized bloom.  We
>> > probably need to estimate that based on recent pg workload/activity.  If
>> > we have a filter that "fills up" quickly, the next one will be bigger.  If
>> > we have an unfull one after some period of time, we can compress/fold it
>> > down into a smaller filter (a convenient trick) and start the next smaller
>> > one.
>>
>> So basically we try to keep a running log of the PG activity so we can guess
>> as to how big future filters should be, and if our guess is short we move on
>> and make the next one bigger, but if we overshot we compress it into a smaller
>> filter and make the next one smaller?  I think this is exactly what you said
>> but I want to make sure I got it. :)
>
> Exactly.
>
>> >
>> > I suggest the pg_pool_t get:
>> >
>> >   uint8_t bloom_fpp_pow;  //< desired false positive rate ==
>> > 2^(-bloom_fpp_pow)
>> >   uint16_t bloom_period;  //< period for that false positive rate (seconds)
>> >   uint16_t bloom_bins;    //< the approx "resolution" we want
>> >   uint16_t bloom_max;     //< how many periods we retain history for
>> >
>> > which would make the OSD try to create a new filter roughly every
>> > (period/bins) seconds with some simple estimation.
>> >
>> > Other thoughts on the OSD side stuff:
>> >
>> > - The past history of filters may be large.  I suggest we stick them in an
>> > omap object that lives with the PG and is recovered/backfilled but that
>> > needs to happen in a way that is out of band wrt to other objects.  It
>> > should get scrubbed.
>>
>> Any guesses as to how large this will get yet?  It's too early in the morning
>> for me to do math...
>
> Greg did a quick calc yesterday, I forget.  It was enough that it might
> fit in RAM for a spinning disk for 1-2 days of history, but only
> marginally.

Well, depends on the sort of windows and overlaps we want, but:
120 ops/s * 3600 s/hour * 1 byte/op = ~422KB/hour, or <10MB/day.
(where we here assume each op touches a different object, which is unlikely)

So it should actually fit in RAM just fine as long as we make it
durable — but we need to handle all-SSD setups as well, and those can
get much larger. What's more interesting here is that we can feasibly
store a list of actual, touched objects for only 4 times as much data
(4 byte hashes versus 1 byte/insert in the bloom filter) for those who
require it.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html