On Wed, Oct 2, 2013 at 5:19 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: > On Wed, 2 Oct 2013, Gregory Farnum wrote: >> On Wed, Oct 2, 2013 at 5:02 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: >> > If we make this a special internal object we need to complicate recovery >> > and namespacing to keep is separate from user data. We also need to >> > implement a new API for retrieving, trimming, and so forth. >> > >> > Instead, we could just store the in-progress and completed bloom filters >> > (or even explicit hit list) as regular rados objects in a separate >> > namespace. The namespace could be '.ceph' or similar by default, but >> > configurable in case the user wants something different for some reason. >> > >> > Normal recovery should work unmodified. >> > >> > The normal rados API could be used to fetch (or even delete) old info. >> > >> > I think the main challenge is making an object_locator_t that maps cleanly >> > into a specific PG so that a particular object is always stored exactly >> > with the PG. This should be a pretty easy change to object_locator_t. >> > In the mapping process, all we're doing is hashing the key string and >> > mixing in the pool hash; here we'd just be able to specify the resulting >> > value explicitly. >> > >> > Thoughts? >> > sage >> >> PG splitting? > > Old objects stay with the parent. A clever user could look pg parent > bloom filters if they are looking back in time past the split. > > On merge multiple children would end up in the combined PG. We'd > need to include the pgid in the object name to avoid having multiple > objects with the same name and different locators. > >> That and other internal mechanisms are already going to need to treat >> it as a special object. I think recovery will as well; what happens if >> we're serving up writes during a long-running recovery but haven't >> gotten to recovering that object yet when we need to persist? > > In the simplest case these objects would be created once and never > modified, in which case nothing would prevent them from being created. > If they are past the backfill position, backfill targets wouldn't get a > copy, just like with normal writes. > > On (librados) read, we'd recover it immediately, just as we do with normal > objects. > > In the bloom filter case, being limited to immutable objects should be > fine. Maybe this would work for other future features that need to store > extra stuff with the PG, but I'm unsure what that would be right now. > (Maybe index objects if we ever index on select xattrs?) Oh, so storing each hourly filter (or whatever) as its own object, rather than as an aggregate. I don't particularly mind hiding a special namespace from people (we already take over ".snap" in CephFS and nobody cares about that); I think we should probably do that for non-admin users. > In any case, this seems simpler than special casing a bunch of recovery > and backend namespaces for special hidden objects and setting up new APIs > to access them. OTOH, it doesn't give users absolute reign over their > namespaces by hiding system objects. That is seeming less important to > me now than it once did, though. Yeah, I have no problem with that, and if we're doing immutable write-once objects I think this should work fine. I just have concerns if we extend it to special write-many objects, which I suspect will be harder (though perhaps the normal object dependency recovery algorithms would handle it naturally). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html