Re: Refactor DBObjectMap Proposal

Sage Weil <sage@xxxxxxxxxxx> · Sat, 21 Dec 2013 21:20:38 -0800 (PST)

On Sat, 21 Dec 2013, Haomai Wang wrote:
> On Dec 13, 2013, at 1:01 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> 
> > On Thu, 12 Dec 2013, Haomai Wang wrote:
> >> On Thu, Dec 12, 2013 at 1:26 PM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> >>> [adding cc ceph-devel]
> > 
> > [attempt 2]
> > 
> >>> 
> >>> On Wed, 11 Dec 2013, Haomai Wang wrote:
> >>>> Hi Sage,
> >>>> 
> >>>> Since last CDS, you have pointed jobs see below:
> >>>> 
> >>>> ============================
> >>>> 2. DBObjectMap: refactor interface
> >>>>    1. expose underlying KeyValueDB transactions to caller, so they
> >>>> can bundle several DBObjectMap ops together and capture an entire
> >>>> ObjectStore::Transaction's worth of work)
> >>>>    2.expose the user prefixes in a generic way, instead of
> >>>> hard-coding in the omap, xattr, and various internal namespaces
> >>>> 
> >>>> 3. stripe file data over keys
> >>>>    1. Build a class that will implement a file data interface (read
> >>>> extent, write extent, truncate, zero, etc.) on top of DBObjectMap
> >>>>    2. stripe data over keys of size X (e.g., 1MB, which seems to be
> >>>> the limit people are converging around)
> >>>>    3. store file size information in a metadata key.  maybe this can
> >>>> be DBObjectMap::Header; maybe not
> >>>>    4. contemplate future optimizations that put small objects
> >>>> "inline" in the Header (or equivalent) key
> >>>> ============================
> >>>> 
> >>>> I'm interested to implement it and I don't know whether you or others
> >>>> started to do it. Now I want to describe my idea.
> >>> 
> >>> Nobody is working on this just yet, although there is a lot of interest in
> >>> this area so your timing is very good!
> >>> 
> >>>> According to your comments, I think about implementing strip file data
> >>>> over keys in KeyValueStore class. Add a field called "userdata" to
> >>>> DBObjectMap::Header which is explained by caller such as
> >>>> KeyValueStore. Of course, we need to add CRUD operation interfaces for
> >>>> "userdata" field. So KeyValueStore will make use of "userdata" to
> >>>> manage stripped layer. Maybe a metadata table to map offset->key_name.
> >>> 
> >>> Yes.  My original thought is to make the DBObjectMap type fields a bit
> >>> more general (instead of the hard-coded #defines), but I don't think it
> >>> matters too much.
> >>> 
> >>> For the metadata table, yes eventually.. but I would keep it simple for
> >>> the first pass and iterate from there.
> >>> 
> >>>> Although DBObjectMap already implement clone operation on
> >>>> "USER_PREFIX" keys, I really don't like operations like lookup_parent
> >>>> which will cause dependent lookup chain resulting to performance
> >>>> degrade just like librbd. And I suspect that if using the current
> >>>> DBObjectMap methods to manage cloned objects, it may occur performance
> >>>> problems.  So DBObjectMap need to expose pure KeyValueDB interfaces
> >>>> called by KeyValueStore to store stripped keys which is controlled by
> >>>> a metadata table mentioned above. Others such as xattr and omap
> >>>> namespace won't be destroyed. Clone operation will be implemented via
> >>>> DBObjectMap::clone, actual object data won't be changed and only
> >>>> metadata table referenced to "userdata" will be copied. Any write
> >>>> operation will be redirected to new key. In other word, it may looks
> >>>> like librbd did, but here we implement it in ROW not COW.
> >>>> 
> >>>> The reason to design like above contains:
> >>>> 1. Export more works to KeyValueStore not DBObjectMap, DBObjectMap is
> >>>> used by FileStore which will limit big changes
> >>> 
> >>> Yes; we need to be a bit careful here.  I'm hoping the main changes though
> >>> are really just moving the transaction create and submit boilerplate in
> >>> each method into the FileStore callers?
> >> 
> >> In my mind, I don't want to change the caller codes such as FileStore.
> >> It works well now. ;-)
> > 
> > True.  We can also just make a second layer of methods (_foo() instead of 
> > foo() or someting) that take the transaction as an argument.
> > 
> > Or just fork DBObjectMap entirely so that we don't need to worry about 
> > breaking FileStore ondisk compatibility; we will likely want/need to do 
> > something like that eventually anyway!
> 
> I'm confusing by "_remove" interface in FileStore that doesn't remove omap
> keys with corresponding object. And I try to dump transaction what
> "rados rm object -p data" doing, actually no delete operations with omap keys.
> 
> So I'm wonder that it's the proper we don't remove omap keys? And I notice
> MemStore did omap erase operation:
>   c->object_map.erase(oid);
>   c->object_hash.erase(oid);

FileStore::_remove() calls lfn_unlink(), which calls 
object_map->clear(...) (if nlink == 0).

I think that's what you're looking for?

sage

> 
> > 
> > sage
> > 
> >>> 
> >>>> 2. Read/Write object is a more frequenter operation which different
> >>>> from OMap or xattr operations, we need more special handler now or
> >>>> future to optimize.
> >>>> 3. Different kv backend may have different features just like
> >>>> FileSystemBackend, we would like to deal with these at KeyValueStore
> >>>> not DBObjectMap or upper class.
> >>>> 4. DBObjectMap is a little replicated and maybe not suitable to do more things.
> >>> 
> >>> I'm not fully following this description, but it sounds like you're
> >>> thinking about the right issues.  A few comments:
> >>> 
> >>> - In the ideal case, we'd like to minimize the number of lookups/keys we
> >>> query to access an object.  This is a bit less important for objects that
> >>> are cloned (they tend to be snapshots... mostly).
> >>> 
> >>> - I think it makes sense to make the main header key for an object be able
> >>> to embed various bits of useful data, like
> >>> 
> >>> - all of the xattrs, if there aren't many of them
> >>> - the file size
> >>> - the file content, if it is small
> >>> 
> >>> No need for this in the initial implementation, but we should design
> >>> something that can accomodate it.
> >>> 
> >>> - It would be nice to capture the striping CRUD stuff in a separate class;
> >>> a child of DBObjectMap or something similar.  This will make it easy to
> >>> swap out and/or experiment with different approaches.
> >>> 
> >>>> So in this proposal, DBObjectMap will serve as a bridge in the front
> >>>> of KeyValueDB. KeyValueStore mainly use DBObjectMap API to store
> >>>> stripped object and DBObjectMap::Header to store metadata. If so, my
> >>>> previous implementation could be fully make use of. :-)
> >>> 
> >>> That's great news!  Let me know if there is anything we can do to help
> >>> here.
> >>> 
> >>> sage
> >> 
> >> Thanks for your comments!
> >> 
> >> 
> >> -- 
> >> Best Regards,
> >> 
> >> Wheat
> 
> Best regards,
> Wheats
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html