Re: thought on storing bloom (hit) info

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 2 Oct 2013 17:36:22 -0700

On Wed, Oct 2, 2013 at 5:19 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> On Wed, 2 Oct 2013, Gregory Farnum wrote:
>> On Wed, Oct 2, 2013 at 5:02 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>> > If we make this a special internal object we need to complicate recovery
>> > and namespacing to keep is separate from user data.  We also need to
>> > implement a new API for retrieving, trimming, and so forth.
>> >
>> > Instead, we could just store the in-progress and completed bloom filters
>> > (or even explicit hit list) as regular rados objects in a separate
>> > namespace.  The namespace could be '.ceph' or similar by default, but
>> > configurable in case the user wants something different for some reason.
>> >
>> > Normal recovery should work unmodified.
>> >
>> > The normal rados API could be used to fetch (or even delete) old info.
>> >
>> > I think the main challenge is making an object_locator_t that maps cleanly
>> > into a specific PG so that a particular object is always stored exactly
>> > with the PG.  This should be a pretty easy change to object_locator_t.
>> > In the mapping process, all we're doing is hashing the key string and
>> > mixing in the pool hash; here we'd just be able to specify the resulting
>> > value explicitly.
>> >
>> > Thoughts?
>> > sage
>>
>> PG splitting?
>
> Old objects stay with the parent.  A clever user could look pg parent
> bloom filters if they are looking back in time past the split.
>
> On merge multiple children would end up in the combined PG.  We'd
> need to include the pgid in the object name to avoid having multiple
> objects with the same name and different locators.
>
>> That and other internal mechanisms are already going to need to treat
>> it as a special object. I think recovery will as well; what happens if
>> we're serving up writes during a long-running recovery but haven't
>> gotten to recovering that object yet when we need to persist?
>
> In the simplest case these objects would be created once and never
> modified, in which case nothing would prevent them from being created.
> If they are past the backfill position, backfill targets wouldn't get a
> copy, just like with normal writes.
>
> On (librados) read, we'd recover it immediately, just as we do with normal
> objects.
>
> In the bloom filter case, being limited to immutable objects should be
> fine.  Maybe this would work for other future features that need to store
> extra stuff with the PG, but I'm unsure what that would be right now.
> (Maybe index objects if we ever index on select xattrs?)

Oh, so storing each hourly filter (or whatever) as its own object,
rather than as an aggregate. I don't particularly mind hiding a
special namespace from people (we already take over ".snap" in CephFS
and nobody cares about that); I think we should probably do that for
non-admin users.

> In any case, this seems simpler than special casing a bunch of recovery
> and backend namespaces for special hidden objects and setting up new APIs
> to access them.  OTOH, it doesn't give users absolute reign over their
> namespaces by hiding system objects.  That is seeming less important to
> me now than it once did, though.

Yeah, I have no problem with that, and if we're doing immutable
write-once objects I think this should work fine. I just have concerns
if we extend it to special write-many objects, which I suspect will be
harder (though perhaps the normal object dependency recovery
algorithms would handle it naturally).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html