OK, I just viewed part of codes and realized it. It looks like we want to sync metadata each time when WAL and we ahead do_transaction jobs before WAL things. It may cause larger latency than before? Because the latency of do_transactions couldn't be simply ignore under some latency sensitive cases and it may trigger lookup operation(get_onode). On Fri, Feb 20, 2015 at 11:00 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Fri, 20 Feb 2015, Haomai Wang wrote: >> So cool! >> >> A little notes: >> >> 1. What about sync thread in NewStore? > > My thought right now is that there will be a WAL thread and (maybe) a > transaction commit completion thread. What do you mean by sync thread? > > One thing I want to avoid is the current 'op' thread in FileStore. > Instead of queueing a transaction we will start all of the aio operations > synchronously. This has the nice (?) side-effect that if there is memory > blackpressure it will block at submit time so we don't need to do our own > throttling. (...though we may want to do it ourselves later anyway.) > >> 2. Could we consider skipping WAL for large overwrite(backfill, RGW)? > > We do (or will)... if there is a truncate to 0 it doesn't need to do WAL > at all. The onode stores the size so we'll ignore any stray bytes after > that in the file; that let's us do the truncate async after the txn > commits. (Slightly sloppy but the space leakage window is so small I > don't think it's worth worrying about.) > >> 3. Sorry, what means [aio_]fsync? > > aio_fsync is just an fsync that's submitted as an aio operation. It'll > make fsync fit into the same bucket as the aio writes we queue up, and it > also means that if/when the experimental batched fsync stuff goes into XFS > we'll take advantage of it (lots of fsyncs will be merged into a single > XFS transaction and be much more efficient). > > sage > > >> >> >> On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> > Hi everyone, >> > >> > We talked a bit about the proposed "KeyFile" backend a couple months back. >> > I've started putting together a basic implementation and wanted to give >> > people and update about what things are currently looking like. We're >> > calling it NewStore for now unless/until someone comes up with a better >> > name (KeyFileStore is way too confusing). (*) >> > >> > You can peruse the incomplete code at >> > >> > https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore >> > >> > This is a bit of a brain dump. Please ask questions if anything isn't >> > clear. Also keep in mind I'm still at the stage where I'm trying to get >> > it into a semi-working state as quickly as possible so the implementation >> > is pretty rough. >> > >> > Basic design: >> > >> > We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata. >> > Object data is stored in files with simple names (%d) in a simple >> > directory structure (one level deep, default 1M files per dir). The main >> > piece of metadata we store is a mapping from object name (ghobject_t) to >> > onode_t, which looks like this: >> > >> > struct onode_t { >> > uint64_t size; ///< object size >> > map<string, bufferptr> attrs; ///< attrs >> > map<uint64_t, fragment_t> data_map; ///< data (offset to fragment mapping) >> > >> > i.e., it's what we used to rely on xattrs on the inode for. Here, we'll >> > only lean on the file system for file data and it's block management. >> > >> > fragment_t looks like >> > >> > struct fragment_t { >> > uint32_t offset; ///< offset in file to first byte of this fragment >> > uint32_t length; ///< length of fragment/extent >> > fid_t fid; ///< file backing this fragment >> > >> > and fid_t is >> > >> > struct fid_t { >> > uint32_t fset, fno; // identify the file name: fragments/%d/%d >> > >> > To start we'll keep the mapping pretty simple (just one fragment_t) but >> > later we can go for varying degrees of complexity. >> > >> > We lean on the kvdb for our transactions. >> > >> > If we are creating new objects, we write data into a new file/fid, >> > [aio_]fsync, and then commit the transaction. >> > >> > If we are doing an overwrite, we include a write-ahead log (wal) >> > item in our transaction, and then apply it afterwards. For example, a 4k >> > overwrite would make whatever metadata changes are included, and a wal >> > item that says "then overwrite this 4k in this fid with this data". i.e., >> > the worst case is more or less what FileStore is doing now with its >> > journal, except here we're using the kvdb (and its journal) for that. On >> > restart we can queue up and apply any unapplied wal items. >> > >> > An alternative approach here that we discussed a bit yesterday would be to >> > write the small overwrites into the kvdb adjacent to the onode. Actually >> > writing them back to the file could be deferred until later, maybe when >> > there are many small writes to be done together. >> > >> > But right now the write behavior is very simple, and handles just 3 cases: >> > >> > https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339 >> > >> > 1. New object: create a new file and write there. >> > >> > 2. Append: append to an existing fid. We store the size in the onode so >> > we can be a bit sloppy and in the failure case (where we write some >> > extra data to the file but don't commit the onode) just ignore any >> > trailing file data. >> > >> > 3. Anything else: generate a WAL item. >> > >> > 4. Maybe later, for some small [over]writes, we instead put the new data >> > next to the onode. >> > >> > There is no omap yet. I think we should do basically what DBObjectMap did >> > (with a layer of indirection to allow clone etc), but we need to rejigger >> > it so that the initial pointer into that structure is embedded in the >> > onode. We may want to do some other optimization to avoid extra >> > indirection in the common case. Leaving this for later, though... >> > >> > We are designing for the case where the workload is already sharded across >> > collections. Each collection gets an in-memory Collection, which has its >> > own RWLock and its own onode_map (SharedLRU cache). A split will >> > basically amount to registering the new collection in the kvdb and >> > clearing the in-memory onode cache. >> > >> > There is a TransContext structure that is used to track the progress of a >> > transaction. It'll list which fd's need to get synced pre-commit, which >> > onodes need to get written back in the transaction, and any WAL items to >> > include and queue up after the transaction commits. Right now the >> > queue_transaction path does most of the work synchronously just to get >> > things working. Looking ahead I think what it needs to do is: >> > >> > - assemble the transaction >> > - start any aio writes (we could use O_DIRECT here if the new hints >> > include WONTNEED?) >> > - start any aio fsync's >> > - queue kvdb transaction >> > - fire onreadable[_sync] notifications (I suspect we'll want to do this >> > unconditionally; maybe we avoid using them entirely?) >> > >> > On transaction commit, >> > - fire commit notifications >> > - queue WAL operations to a finisher >> > >> > The WAL ops will be linked to the TransContext so that if you want to do a >> > read on the onode you can block until it completes. If we keep the >> > (currently simple) locking then we can use the Collection rwlock to block >> > new writes while we want for previous ones to apply. Or we can get more >> > granular with the read vs write locks, but I'm not sure it'll be any use >> > until we make major changes in the OSD (like dispatching parallel reads >> > within a PG). >> > >> > Clone is annoying; if the FS doesn't support it natively (anything not >> > btrfs) I think we should just do a sync read and then write for >> > simplicity. >> > >> > A few other thoughts: >> > >> > - For a fast kvdb, we may want to do the transaction commit synchronously. >> > For disk backends I think we'll want it async, though, to avoid blocking >> > the caller. >> > >> > - The fid_t has a inode number stashed in it. The idea is to use >> > open_by_handle to avoid traversing the (shallow) directory and go straight >> > to the inode. On XFS this means we traverse the inode btree to verify it >> > is in fast a valid ino, which isn't totally ideal but probably what we >> > have to live with. Note that open_by_handle will work on any other >> > (NFS-exportable) filesystem as well so this is in no way XFS-specific. >> > This is implemented yet, but when we do, we'll probably want to verify we >> > got the right file by putting some id in an xattr; that way you could >> > safely copy the whole thing to another filesystem and it could gracefully >> > fall back to opening using the file names. >> > >> > - I think we could build a variation on this implementation on top of an >> > NVMe device instead of a file system. It could pretty trivially lay out >> > writes in the address space as a linear sweep across the virutal address >> > space. If the NVMe address space is big enough, maybe we could even avoid >> > thinking about reusing addresses for deleted object? We'd just send a >> > discard and then forget about it. Not sure if the address space is really >> > that big, though... If not, we'd need to do make a simple allocator >> > (blah). >> > >> > sage >> > >> > >> > * This follows in the Messenger's naming footsteps, which went like this: >> > MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended >> > up being anything but simple). >> > -- >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Best Regards, >> >> Wheat >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- Best Regards, Wheat -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html