So cool! A little notes: 1. What about sync thread in NewStore? 2. Could we consider skipping WAL for large overwrite(backfill, RGW)? 3. Sorry, what means [aio_]fsync? On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > Hi everyone, > > We talked a bit about the proposed "KeyFile" backend a couple months back. > I've started putting together a basic implementation and wanted to give > people and update about what things are currently looking like. We're > calling it NewStore for now unless/until someone comes up with a better > name (KeyFileStore is way too confusing). (*) > > You can peruse the incomplete code at > > https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore > > This is a bit of a brain dump. Please ask questions if anything isn't > clear. Also keep in mind I'm still at the stage where I'm trying to get > it into a semi-working state as quickly as possible so the implementation > is pretty rough. > > Basic design: > > We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata. > Object data is stored in files with simple names (%d) in a simple > directory structure (one level deep, default 1M files per dir). The main > piece of metadata we store is a mapping from object name (ghobject_t) to > onode_t, which looks like this: > > struct onode_t { > uint64_t size; ///< object size > map<string, bufferptr> attrs; ///< attrs > map<uint64_t, fragment_t> data_map; ///< data (offset to fragment mapping) > > i.e., it's what we used to rely on xattrs on the inode for. Here, we'll > only lean on the file system for file data and it's block management. > > fragment_t looks like > > struct fragment_t { > uint32_t offset; ///< offset in file to first byte of this fragment > uint32_t length; ///< length of fragment/extent > fid_t fid; ///< file backing this fragment > > and fid_t is > > struct fid_t { > uint32_t fset, fno; // identify the file name: fragments/%d/%d > > To start we'll keep the mapping pretty simple (just one fragment_t) but > later we can go for varying degrees of complexity. > > We lean on the kvdb for our transactions. > > If we are creating new objects, we write data into a new file/fid, > [aio_]fsync, and then commit the transaction. > > If we are doing an overwrite, we include a write-ahead log (wal) > item in our transaction, and then apply it afterwards. For example, a 4k > overwrite would make whatever metadata changes are included, and a wal > item that says "then overwrite this 4k in this fid with this data". i.e., > the worst case is more or less what FileStore is doing now with its > journal, except here we're using the kvdb (and its journal) for that. On > restart we can queue up and apply any unapplied wal items. > > An alternative approach here that we discussed a bit yesterday would be to > write the small overwrites into the kvdb adjacent to the onode. Actually > writing them back to the file could be deferred until later, maybe when > there are many small writes to be done together. > > But right now the write behavior is very simple, and handles just 3 cases: > > https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339 > > 1. New object: create a new file and write there. > > 2. Append: append to an existing fid. We store the size in the onode so > we can be a bit sloppy and in the failure case (where we write some > extra data to the file but don't commit the onode) just ignore any > trailing file data. > > 3. Anything else: generate a WAL item. > > 4. Maybe later, for some small [over]writes, we instead put the new data > next to the onode. > > There is no omap yet. I think we should do basically what DBObjectMap did > (with a layer of indirection to allow clone etc), but we need to rejigger > it so that the initial pointer into that structure is embedded in the > onode. We may want to do some other optimization to avoid extra > indirection in the common case. Leaving this for later, though... > > We are designing for the case where the workload is already sharded across > collections. Each collection gets an in-memory Collection, which has its > own RWLock and its own onode_map (SharedLRU cache). A split will > basically amount to registering the new collection in the kvdb and > clearing the in-memory onode cache. > > There is a TransContext structure that is used to track the progress of a > transaction. It'll list which fd's need to get synced pre-commit, which > onodes need to get written back in the transaction, and any WAL items to > include and queue up after the transaction commits. Right now the > queue_transaction path does most of the work synchronously just to get > things working. Looking ahead I think what it needs to do is: > > - assemble the transaction > - start any aio writes (we could use O_DIRECT here if the new hints > include WONTNEED?) > - start any aio fsync's > - queue kvdb transaction > - fire onreadable[_sync] notifications (I suspect we'll want to do this > unconditionally; maybe we avoid using them entirely?) > > On transaction commit, > - fire commit notifications > - queue WAL operations to a finisher > > The WAL ops will be linked to the TransContext so that if you want to do a > read on the onode you can block until it completes. If we keep the > (currently simple) locking then we can use the Collection rwlock to block > new writes while we want for previous ones to apply. Or we can get more > granular with the read vs write locks, but I'm not sure it'll be any use > until we make major changes in the OSD (like dispatching parallel reads > within a PG). > > Clone is annoying; if the FS doesn't support it natively (anything not > btrfs) I think we should just do a sync read and then write for > simplicity. > > A few other thoughts: > > - For a fast kvdb, we may want to do the transaction commit synchronously. > For disk backends I think we'll want it async, though, to avoid blocking > the caller. > > - The fid_t has a inode number stashed in it. The idea is to use > open_by_handle to avoid traversing the (shallow) directory and go straight > to the inode. On XFS this means we traverse the inode btree to verify it > is in fast a valid ino, which isn't totally ideal but probably what we > have to live with. Note that open_by_handle will work on any other > (NFS-exportable) filesystem as well so this is in no way XFS-specific. > This is implemented yet, but when we do, we'll probably want to verify we > got the right file by putting some id in an xattr; that way you could > safely copy the whole thing to another filesystem and it could gracefully > fall back to opening using the file names. > > - I think we could build a variation on this implementation on top of an > NVMe device instead of a file system. It could pretty trivially lay out > writes in the address space as a linear sweep across the virutal address > space. If the NVMe address space is big enough, maybe we could even avoid > thinking about reusing addresses for deleted object? We'd just send a > discard and then forget about it. Not sure if the address space is really > that big, though... If not, we'd need to do make a simple allocator > (blah). > > sage > > > * This follows in the Messenger's naming footsteps, which went like this: > MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended > up being anything but simple). > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Wheat -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html