On Thu, Sep 26, 2013 at 8:52 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > On Thu, 26 Sep 2013, Mark Nelson wrote: >> On 09/25/2013 07:34 PM, Sage Weil wrote: >> > I spent some time on the plane playing with bloom filters. We're looking >> > at using these on the OSD to (more) efficiently keep track of which >> > objects have (probably) been touched in recent history. This can then be >> > used by the caching and tiering code to identify objects that have >> > (definitely) not been touched that should be demoted or purged from a >> > given tier. >> >> You know this is making me want to work on my photon mapping engine again >> instead of running benchmarks for Bryan. :P Anyway, this seems like a great >> use case. Are there any other places we want to fast intersection tests where >> we could use these? k-nearest-neighbor searches would be tons of fun to do >> too if we only had a use case! >> >> > >> > The false positive probility (in our case, probability we'll think an >> > object has been touched when it actually hasn't) is dependent on size of >> > the bloom filter and the number of insertions. >> > >> > When you create the bloom_filter, you tell it how many insertions you >> > expect to do and what fpp you want and it sizes itself. For ~1% fpp, it >> > costs just over 1 byte per insertion. >> >> It's been a long time since I looked at bloom filters, but isn't this also >> affected by the number of hash functions used? Are we going to try to have it >> auto-select the number? > > The bloom_filter class hides this. You feed it the desired fpp and > insertion count and it chooses the number of hashes (actually, 1 hash but > N salts) and table size. Unit tests verify the actual fpp very closely > matches the target. > >> >> > >> > wip-bloom has some cleanups and unit tests that test the false positive >> > behavior. >> > >> > A few issues/notes: >> > >> > - If we build a sequence of filters covering intervals of time, this let's >> > us estimate "atime". However, if we have N filters of 1/Nth the size, our >> > total_ffp = N * fpp, so each bin's bloom needs to be 1/Nth as big with >> > 1/Nth the desired fpp. In my simple test, going from 1 to 16 bins moves >> > the bloom_filter size from 20 KB to 32 KB (that's 16k insertions and >> > target_ffp of .01). >> > >> > - In the OSD, we won't necessarily know how many insertions a bin will >> > have. The user inputs will/should be something like: >> > - target fpp >> > - time interval we want to record/estimate over (say, 1 day) >> > and we'll need to translate that into a particular sized bloom. We >> > probably need to estimate that based on recent pg workload/activity. If >> > we have a filter that "fills up" quickly, the next one will be bigger. If >> > we have an unfull one after some period of time, we can compress/fold it >> > down into a smaller filter (a convenient trick) and start the next smaller >> > one. >> >> So basically we try to keep a running log of the PG activity so we can guess >> as to how big future filters should be, and if our guess is short we move on >> and make the next one bigger, but if we overshot we compress it into a smaller >> filter and make the next one smaller? I think this is exactly what you said >> but I want to make sure I got it. :) > > Exactly. > >> > >> > I suggest the pg_pool_t get: >> > >> > uint8_t bloom_fpp_pow; //< desired false positive rate == >> > 2^(-bloom_fpp_pow) >> > uint16_t bloom_period; //< period for that false positive rate (seconds) >> > uint16_t bloom_bins; //< the approx "resolution" we want >> > uint16_t bloom_max; //< how many periods we retain history for >> > >> > which would make the OSD try to create a new filter roughly every >> > (period/bins) seconds with some simple estimation. >> > >> > Other thoughts on the OSD side stuff: >> > >> > - The past history of filters may be large. I suggest we stick them in an >> > omap object that lives with the PG and is recovered/backfilled but that >> > needs to happen in a way that is out of band wrt to other objects. It >> > should get scrubbed. >> >> Any guesses as to how large this will get yet? It's too early in the morning >> for me to do math... > > Greg did a quick calc yesterday, I forget. It was enough that it might > fit in RAM for a spinning disk for 1-2 days of history, but only > marginally. Well, depends on the sort of windows and overlaps we want, but: 120 ops/s * 3600 s/hour * 1 byte/op = ~422KB/hour, or <10MB/day. (where we here assume each op touches a different object, which is unlikely) So it should actually fit in RAM just fine as long as we make it durable — but we need to handle all-SSD setups as well, and those can get much larger. What's more interesting here is that we can feasibly store a list of actual, touched objects for only 4 times as much data (4 byte hashes versus 1 byte/insert in the bloom filter) for those who require it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html