On Fri, 20 Feb 2015, Haomai Wang wrote: > So cool! > > A little notes: > > 1. What about sync thread in NewStore? My thought right now is that there will be a WAL thread and (maybe) a transaction commit completion thread. What do you mean by sync thread? One thing I want to avoid is the current 'op' thread in FileStore. Instead of queueing a transaction we will start all of the aio operations synchronously. This has the nice (?) side-effect that if there is memory blackpressure it will block at submit time so we don't need to do our own throttling. (...though we may want to do it ourselves later anyway.) > 2. Could we consider skipping WAL for large overwrite(backfill, RGW)? We do (or will)... if there is a truncate to 0 it doesn't need to do WAL at all. The onode stores the size so we'll ignore any stray bytes after that in the file; that let's us do the truncate async after the txn commits. (Slightly sloppy but the space leakage window is so small I don't think it's worth worrying about.) > 3. Sorry, what means [aio_]fsync? aio_fsync is just an fsync that's submitted as an aio operation. It'll make fsync fit into the same bucket as the aio writes we queue up, and it also means that if/when the experimental batched fsync stuff goes into XFS we'll take advantage of it (lots of fsyncs will be merged into a single XFS transaction and be much more efficient). sage > > > On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > Hi everyone, > > > > We talked a bit about the proposed "KeyFile" backend a couple months back. > > I've started putting together a basic implementation and wanted to give > > people and update about what things are currently looking like. We're > > calling it NewStore for now unless/until someone comes up with a better > > name (KeyFileStore is way too confusing). (*) > > > > You can peruse the incomplete code at > > > > https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore > > > > This is a bit of a brain dump. Please ask questions if anything isn't > > clear. Also keep in mind I'm still at the stage where I'm trying to get > > it into a semi-working state as quickly as possible so the implementation > > is pretty rough. > > > > Basic design: > > > > We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata. > > Object data is stored in files with simple names (%d) in a simple > > directory structure (one level deep, default 1M files per dir). The main > > piece of metadata we store is a mapping from object name (ghobject_t) to > > onode_t, which looks like this: > > > > struct onode_t { > > uint64_t size; ///< object size > > map<string, bufferptr> attrs; ///< attrs > > map<uint64_t, fragment_t> data_map; ///< data (offset to fragment mapping) > > > > i.e., it's what we used to rely on xattrs on the inode for. Here, we'll > > only lean on the file system for file data and it's block management. > > > > fragment_t looks like > > > > struct fragment_t { > > uint32_t offset; ///< offset in file to first byte of this fragment > > uint32_t length; ///< length of fragment/extent > > fid_t fid; ///< file backing this fragment > > > > and fid_t is > > > > struct fid_t { > > uint32_t fset, fno; // identify the file name: fragments/%d/%d > > > > To start we'll keep the mapping pretty simple (just one fragment_t) but > > later we can go for varying degrees of complexity. > > > > We lean on the kvdb for our transactions. > > > > If we are creating new objects, we write data into a new file/fid, > > [aio_]fsync, and then commit the transaction. > > > > If we are doing an overwrite, we include a write-ahead log (wal) > > item in our transaction, and then apply it afterwards. For example, a 4k > > overwrite would make whatever metadata changes are included, and a wal > > item that says "then overwrite this 4k in this fid with this data". i.e., > > the worst case is more or less what FileStore is doing now with its > > journal, except here we're using the kvdb (and its journal) for that. On > > restart we can queue up and apply any unapplied wal items. > > > > An alternative approach here that we discussed a bit yesterday would be to > > write the small overwrites into the kvdb adjacent to the onode. Actually > > writing them back to the file could be deferred until later, maybe when > > there are many small writes to be done together. > > > > But right now the write behavior is very simple, and handles just 3 cases: > > > > https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339 > > > > 1. New object: create a new file and write there. > > > > 2. Append: append to an existing fid. We store the size in the onode so > > we can be a bit sloppy and in the failure case (where we write some > > extra data to the file but don't commit the onode) just ignore any > > trailing file data. > > > > 3. Anything else: generate a WAL item. > > > > 4. Maybe later, for some small [over]writes, we instead put the new data > > next to the onode. > > > > There is no omap yet. I think we should do basically what DBObjectMap did > > (with a layer of indirection to allow clone etc), but we need to rejigger > > it so that the initial pointer into that structure is embedded in the > > onode. We may want to do some other optimization to avoid extra > > indirection in the common case. Leaving this for later, though... > > > > We are designing for the case where the workload is already sharded across > > collections. Each collection gets an in-memory Collection, which has its > > own RWLock and its own onode_map (SharedLRU cache). A split will > > basically amount to registering the new collection in the kvdb and > > clearing the in-memory onode cache. > > > > There is a TransContext structure that is used to track the progress of a > > transaction. It'll list which fd's need to get synced pre-commit, which > > onodes need to get written back in the transaction, and any WAL items to > > include and queue up after the transaction commits. Right now the > > queue_transaction path does most of the work synchronously just to get > > things working. Looking ahead I think what it needs to do is: > > > > - assemble the transaction > > - start any aio writes (we could use O_DIRECT here if the new hints > > include WONTNEED?) > > - start any aio fsync's > > - queue kvdb transaction > > - fire onreadable[_sync] notifications (I suspect we'll want to do this > > unconditionally; maybe we avoid using them entirely?) > > > > On transaction commit, > > - fire commit notifications > > - queue WAL operations to a finisher > > > > The WAL ops will be linked to the TransContext so that if you want to do a > > read on the onode you can block until it completes. If we keep the > > (currently simple) locking then we can use the Collection rwlock to block > > new writes while we want for previous ones to apply. Or we can get more > > granular with the read vs write locks, but I'm not sure it'll be any use > > until we make major changes in the OSD (like dispatching parallel reads > > within a PG). > > > > Clone is annoying; if the FS doesn't support it natively (anything not > > btrfs) I think we should just do a sync read and then write for > > simplicity. > > > > A few other thoughts: > > > > - For a fast kvdb, we may want to do the transaction commit synchronously. > > For disk backends I think we'll want it async, though, to avoid blocking > > the caller. > > > > - The fid_t has a inode number stashed in it. The idea is to use > > open_by_handle to avoid traversing the (shallow) directory and go straight > > to the inode. On XFS this means we traverse the inode btree to verify it > > is in fast a valid ino, which isn't totally ideal but probably what we > > have to live with. Note that open_by_handle will work on any other > > (NFS-exportable) filesystem as well so this is in no way XFS-specific. > > This is implemented yet, but when we do, we'll probably want to verify we > > got the right file by putting some id in an xattr; that way you could > > safely copy the whole thing to another filesystem and it could gracefully > > fall back to opening using the file names. > > > > - I think we could build a variation on this implementation on top of an > > NVMe device instead of a file system. It could pretty trivially lay out > > writes in the address space as a linear sweep across the virutal address > > space. If the NVMe address space is big enough, maybe we could even avoid > > thinking about reusing addresses for deleted object? We'd just send a > > discard and then forget about it. Not sure if the address space is really > > that big, though... If not, we'd need to do make a simple allocator > > (blah). > > > > sage > > > > > > * This follows in the Messenger's naming footsteps, which went like this: > > MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended > > up being anything but simple). > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Best Regards, > > Wheat > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html