Re: NewStore update

Haomai Wang <haomaiwang@xxxxxxxxx> · Sat, 21 Feb 2015 00:16:11 +0800

OK, I just viewed part of codes and realized it.

It looks like we want to sync metadata each time when WAL and we ahead
do_transaction jobs before WAL things. It may cause larger latency
than before? Because the latency of do_transactions couldn't be simply
ignore under some latency sensitive cases and it may trigger lookup
operation(get_onode).

On Fri, Feb 20, 2015 at 11:00 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Fri, 20 Feb 2015, Haomai Wang wrote:
>> So cool!
>>
>> A little notes:
>>
>> 1. What about sync thread in NewStore?
>
> My thought right now is that there will be a WAL thread and (maybe) a
> transaction commit completion thread.  What do you mean by sync thread?
>
> One thing I want to avoid is the current 'op' thread in FileStore.
> Instead of queueing a transaction we will start all of the aio operations
> synchronously.  This has the nice (?) side-effect that if there is memory
> blackpressure it will block at submit time so we don't need to do our own
> throttling.  (...though we may want to do it ourselves later anyway.)
>
>> 2. Could we consider skipping WAL for large overwrite(backfill, RGW)?
>
> We do (or will)... if there is a truncate to 0 it doesn't need to do WAL
> at all.  The onode stores the size so we'll ignore any stray bytes after
> that in the file; that let's us do the truncate async after the txn
> commits.  (Slightly sloppy but the space leakage window is so small I
> don't think it's worth worrying about.)
>
>> 3. Sorry, what means [aio_]fsync?
>
> aio_fsync is just an fsync that's submitted as an aio operation.  It'll
> make fsync fit into the same bucket as the aio writes we queue up, and it
> also means that if/when the experimental batched fsync stuff goes into XFS
> we'll take advantage of it (lots of fsyncs will be merged into a single
> XFS transaction and be much more efficient).
>
> sage
>
>
>>
>>
>> On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> > Hi everyone,
>> >
>> > We talked a bit about the proposed "KeyFile" backend a couple months back.
>> > I've started putting together a basic implementation and wanted to give
>> > people and update about what things are currently looking like.  We're
>> > calling it NewStore for now unless/until someone comes up with a better
>> > name (KeyFileStore is way too confusing). (*)
>> >
>> > You can peruse the incomplete code at
>> >
>> >         https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore
>> >
>> > This is a bit of a brain dump.  Please ask questions if anything isn't
>> > clear.  Also keep in mind I'm still at the stage where I'm trying to get
>> > it into a semi-working state as quickly as possible so the implementation
>> > is pretty rough.
>> >
>> > Basic design:
>> >
>> > We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata.
>> > Object data is stored in files with simple names (%d) in a simple
>> > directory structure (one level deep, default 1M files per dir).  The main
>> > piece of metadata we store is a mapping from object name (ghobject_t) to
>> > onode_t, which looks like this:
>> >
>> >  struct onode_t {
>> >    uint64_t size;                       ///< object size
>> >    map<string, bufferptr> attrs;        ///< attrs
>> >    map<uint64_t, fragment_t> data_map;  ///< data (offset to fragment mapping)
>> >
>> > i.e., it's what we used to rely on xattrs on the inode for.  Here, we'll
>> > only lean on the file system for file data and it's block management.
>> >
>> > fragment_t looks like
>> >
>> >  struct fragment_t {
>> >    uint32_t offset;   ///< offset in file to first byte of this fragment
>> >    uint32_t length;   ///< length of fragment/extent
>> >    fid_t fid;         ///< file backing this fragment
>> >
>> > and fid_t is
>> >
>> >  struct fid_t {
>> >    uint32_t fset, fno;   // identify the file name: fragments/%d/%d
>> >
>> > To start we'll keep the mapping pretty simple (just one fragment_t) but
>> > later we can go for varying degrees of complexity.
>> >
>> > We lean on the kvdb for our transactions.
>> >
>> > If we are creating new objects, we write data into a new file/fid,
>> > [aio_]fsync, and then commit the transaction.
>> >
>> > If we are doing an overwrite, we include a write-ahead log (wal)
>> > item in our transaction, and then apply it afterwards.  For example, a 4k
>> > overwrite would make whatever metadata changes are included, and a wal
>> > item that says "then overwrite this 4k in this fid with this data".  i.e.,
>> > the worst case is more or less what FileStore is doing now with its
>> > journal, except here we're using the kvdb (and its journal) for that.  On
>> > restart we can queue up and apply any unapplied wal items.
>> >
>> > An alternative approach here that we discussed a bit yesterday would be to
>> > write the small overwrites into the kvdb adjacent to the onode.  Actually
>> > writing them back to the file could be deferred until later, maybe when
>> > there are many small writes to be done together.
>> >
>> > But right now the write behavior is very simple, and handles just 3 cases:
>> >
>> >         https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339
>> >
>> > 1. New object: create a new file and write there.
>> >
>> > 2. Append: append to an existing fid.  We store the size in the onode so
>> > we can be a bit sloppy and in the failure case (where we write some
>> > extra data to the file but don't commit the onode) just ignore any
>> > trailing file data.
>> >
>> > 3. Anything else: generate a WAL item.
>> >
>> > 4. Maybe later, for some small [over]writes, we instead put the new data
>> > next to the onode.
>> >
>> > There is no omap yet.  I think we should do basically what DBObjectMap did
>> > (with a layer of indirection to allow clone etc), but we need to rejigger
>> > it so that the initial pointer into that structure is embedded in the
>> > onode.  We may want to do some other optimization to avoid extra
>> > indirection in the common case.  Leaving this for later, though...
>> >
>> > We are designing for the case where the workload is already sharded across
>> > collections.  Each collection gets an in-memory Collection, which has its
>> > own RWLock and its own onode_map (SharedLRU cache).  A split will
>> > basically amount to registering the new collection in the kvdb and
>> > clearing the in-memory onode cache.
>> >
>> > There is a TransContext structure that is used to track the progress of a
>> > transaction.  It'll list which fd's need to get synced pre-commit, which
>> > onodes need to get written back in the transaction, and any WAL items to
>> > include and queue up after the transaction commits.  Right now the
>> > queue_transaction path does most of the work synchronously just to get
>> > things working.  Looking ahead I think what it needs to do is:
>> >
>> >  - assemble the transaction
>> >  - start any aio writes (we could use O_DIRECT here if the new hints
>> > include WONTNEED?)
>> >  - start any aio fsync's
>> >  - queue kvdb transaction
>> >  - fire onreadable[_sync] notifications (I suspect we'll want to do this
>> > unconditionally; maybe we avoid using them entirely?)
>> >
>> > On transaction commit,
>> >  - fire commit notifications
>> >  - queue WAL operations to a finisher
>> >
>> > The WAL ops will be linked to the TransContext so that if you want to do a
>> > read on the onode you can block until it completes.  If we keep the
>> > (currently simple) locking then we can use the Collection rwlock to block
>> > new writes while we want for previous ones to apply.  Or we can get more
>> > granular with the read vs write locks, but I'm not sure it'll be any use
>> > until we make major changes in the OSD (like dispatching parallel reads
>> > within a PG).
>> >
>> > Clone is annoying; if the FS doesn't support it natively (anything not
>> > btrfs) I think we should just do a sync read and then write for
>> > simplicity.
>> >
>> > A few other thoughts:
>> >
>> > - For a fast kvdb, we may want to do the transaction commit synchronously.
>> > For disk backends I think we'll want it async, though, to avoid blocking
>> > the caller.
>> >
>> > - The fid_t has a inode number stashed in it.  The idea is to use
>> > open_by_handle to avoid traversing the (shallow) directory and go straight
>> > to the inode.  On XFS this means we traverse the inode btree to verify it
>> > is in fast a valid ino, which isn't totally ideal but probably what we
>> > have to live with.  Note that open_by_handle will work on any other
>> > (NFS-exportable) filesystem as well so this is in no way XFS-specific.
>> > This is implemented yet, but when we do, we'll probably want to verify we
>> > got the right file by putting some id in an xattr; that way you could
>> > safely copy the whole thing to another filesystem and it could gracefully
>> > fall back to opening using the file names.
>> >
>> > - I think we could build a variation on this implementation on top of an
>> > NVMe device instead of a file system. It could pretty trivially lay out
>> > writes in the address space as a linear sweep across the virutal address
>> > space.  If the NVMe address space is big enough, maybe we could even avoid
>> > thinking about reusing addresses for deleted object?  We'd just send a
>> > discard and then forget about it.  Not sure if the address space is really
>> > that big, though...  If not, we'd need to do make a simple allocator
>> > (blah).
>> >
>> > sage
>> >
>> >
>> > * This follows in the Messenger's naming footsteps, which went like this:
>> > MPIMessenger, NewMessenger, NewerMessenger, SimpleMessenger (which ended
>> > up being anything but simple).
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html