Re: Async reads, sync writes, op thread model discussion

Matt Benjamin <mbenjamin@xxxxxxxxxx> · Fri, 14 Aug 2015 17:19:13 -0400 (EDT)

Hi,

I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.

I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.

My intuition is, since the goal is more deterministic performance in a short horizion, you

a. need to prioritize transparency over novel abstractions
b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline

My .05.

Matt

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

----- Original Message -----
> From: "Milosz Tanski" <milosz@xxxxxxxxx>
> To: "Haomai Wang" <haomaiwang@xxxxxxxxx>
> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@xxxxxxxxxx>, "Samuel Just" <sjust@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>,
> ceph-devel@xxxxxxxxxxxxxxx
> Sent: Friday, August 14, 2015 4:56:26 PM
> Subject: Re: Async reads, sync writes, op thread model discussion
> 
> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
> > <ysadehwe@xxxxxxxxxx> wrote:
> >> Already mentioned it on irc, adding to ceph-devel for the sake of
> >> completeness. I did some infrastructure work for rgw and it seems (at
> >> least to me) that it could at least be partially useful here.
> >> Basically it's an async execution framework that utilizes coroutines.
> >> It's comprised of aio notification manager that can also be tied into
> >> coroutines execution. The coroutines themselves are stackless, they
> >> are implemented as state machines, but using some boost trickery to
> >> hide the details so they can be written very similar to blocking
> >> methods. Coroutines can also execute other coroutines and can be
> >> stacked, or can generate concurrent execution. It's still somewhat in
> >> flux, but I think it's mostly done and already useful at this point,
> >> so if there's anything you could use it might be a good idea to avoid
> >> effort duplication.
> >>
> >
> > coroutines like qemu is cool. The only thing I afraid is the
> > complicate of debug and it's really a big task :-(
> >
> > I agree with sage that this design is really a new implementation for
> > objectstore so that it's harmful to existing objectstore impl. I also
> > suffer the pain from sync read xattr, we may add a async read
> > interface to solove this?
> >
> > For context switch thing, now we have at least 3 cs for one op at osd
> > side. messenger -> op queue -> objectstore queue. I guess op queue ->
> > objectstore is easier to kick off just as sam said. We can make write
> > journal inline with queue_transaction, so the caller could directly
> > handle the transaction right now.
> 
> I would caution agains coroutines (fibers) esp. in a multi-threaded
> environment. Posix has officially obsoleted the swapcontext family of
> functions in 1003.1-2004 and removed it in 1003.1-2008. That's because
> they were notoriously non portable, and buggy. And yes you can use
> something like boost::context / boost::coroutine instead but they also
> have platform limitations. These implementations tend to abuse / turn
> of various platform scrutiny features (like the one for
> setjmp/longjmp). And on top of that many platforms don't consider
> alternative context so you end up with obscure bugs. I've debugged my
> fair share of bugs in Mordor coroutines with C++ exceptions, and errno
> variables (since errno is really a function on linux and it's output a
> pointer to threads errno is marked pure) if your coroutine migrates
> threads. And you need to migrate them because of blocking and uneven
> processor/thread distribution.
> 
> None of these are obstacles that can't be solved, but added together
> they become a pretty long term liability. So I think long and hard
> about it. Qemu doesn't have some of those issues because it's uses a
> single thread and a much simpler C ABI that it deals with.
> 
> An alternative to coroutines that goes a long way towards solving the
> callback spaghetti problem is futures/promises. I'm not talking of the
> very future model that exists in C++11 library but more along the
> lines that exist in other languages (like what's being done in
> Javascript today). There's a good implementation of it Folly (the
> facebook c++11 library). They have a very nice piece of documentation
> here to understand how they work and how they differ.
> 
> That future model is very handy when dealing with the callback control
> flow problem. You can chain a bunch of processing steps that requires
> some async action, return a future and continue so on and so forth.
> Also, it makes handling complex error cases easy by giving you a way
> to skip lots of processing steps strait to onError at the end of the
> chain.
> 
> Take a look at folly. Take a look at the expanded boost futures (they
> call this future continuations:
> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.futures.then
> ). Also, building a cut down future framework just for Ceph (or
> reduced set folly one) might be another option.
> 
> Just an alternative.
> 
> >
> > Anyway, I think we need to do some changes for this field.
> >
> >> Yehuda
> >>
> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
> >>> to the actual interface I presented so much as the notion that the
> >>> next thing to do is restructure the OpWQ users as async state
> >>> machines.
> >>> -Sam
> >>>
> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
> >>>>> Currently, there are some deficiencies in how the OSD maps ops onto
> >>>>> threads:
> >>>>>
> >>>>> 1. Reads are always syncronous limiting the queue depth seen from the
> >>>>> device
> >>>>>    and therefore the possible parallelism.
> >>>>> 2. Writes are always asyncronous forcing even very fast writes to be
> >>>>> completed
> >>>>>    in a seperate thread.
> >>>>> 3. do_op cannot surrender the thread/pg lock during an operation
> >>>>> forcing reads
> >>>>>    required to continue the operation to be syncronous.
> >>>>>
> >>>>> For spinning disks, this is mostly ok since they don't benefit as much
> >>>>> from
> >>>>> large read queues, and writes (filestore with journal) are too slow for
> >>>>> the
> >>>>> thread switches to make a big difference.  For very fast flash,
> >>>>> however, we
> >>>>> want the flexibility to allow the backend to perform writes
> >>>>> syncronously or
> >>>>> asyncronously when it makes sense, and to maintain a larger number of
> >>>>> outstanding reads than we have threads.  To that end, I suggest
> >>>>> changing the
> >>>>> ObjectStore interface to be somewhat polling based:
> >>>>>
> >>>>> /// Create new token
> >>>>> void *create_operation_token() = 0;
> >>>>> bool is_operation_complete(void *token) = 0;
> >>>>> bool is_operation_committed(void *token) = 0;
> >>>>> bool is_operation_applied(void *token) = 0;
> >>>>> void wait_for_committed(void *token) = 0;
> >>>>> void wait_for_applied(void *token) = 0;
> >>>>> void wait_for_complete(void *token) = 0;
> >>>>> /// Get result of operation
> >>>>> int get_result(void *token) = 0;
> >>>>> /// Must only be called once is_opearation_complete(token)
> >>>>> void reset_operation_token(void *token) = 0;
> >>>>> /// Must only be called once is_opearation_complete(token)
> >>>>> void detroy_operation_token(void *token) = 0;
> >>>>>
> >>>>> /**
> >>>>>  * Queue a transaction
> >>>>>  *
> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>>  * If the operation is completed syncronously, token can be resused
> >>>>>  * without calling reset_operation_token.
> >>>>>  *
> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async
> >>>>>  */
> >>>>> int queue_transaction(
> >>>>>   Transaction *t,
> >>>>>   OpSequencer *osr,
> >>>>>   void *token
> >>>>>   ) = 0;
> >>>>>
> >>>>> /**
> >>>>>  * Queue a transaction
> >>>>>  *
> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>>  * If the operation is completed syncronously, token can be resused
> >>>>>  * without calling reset_operation_token.
> >>>>>  *
> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
> >>>>>  */
> >>>>> int read(..., void *token) = 0;
> >>>>> ...
> >>>>>
> >>>>> The "token" concept here is opaque to allow the implementation some
> >>>>> flexibility.  Ideally, it would be nice to be able to include libaio
> >>>>> operation contexts directly.
> >>>>>
> >>>>> The main goal here is for the backend to have the freedom to complete
> >>>>> writes and reads asyncronously or syncronously as the sitation
> >>>>> warrants.
> >>>>> It also leaves the interface user in control of where the operation
> >>>>> completion is handled.  Each op thread can therefore handle its own
> >>>>> completions:
> >>>>>
> >>>>> struct InProgressOp {
> >>>>>   PGRef pg;
> >>>>>   ObjectStore::Token *token;
> >>>>>   OpContext *ctx;
> >>>>> };
> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
> >>>>
> >>>> Probably a deque<> since we'll be pushign new requests and slurping off
> >>>> completed ones?  Or, we can make token not completely opaque, so that it
> >>>> includes a boost::intrusive::list node and can be strung on a
> >>>> user-managed
> >>>> queue.
> >>>>
> >>>>> for (auto op : in_progress) {
> >>>>>   op.token = objectstore->create_operation_token();
> >>>>> }
> >>>>>
> >>>>> uint64_t next_to_start = 0;
> >>>>> uint64_t next_to_complete = 0;
> >>>>>
> >>>>> while (1) {
> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
> >>>>>     objectstore->wait_for_complete(op.token);
> >>>>>   }
> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
> >>>>>     if (objectstore->is_operation_complete(op.token)) {
> >>>>>       PGRef pg = op.pg;
> >>>>>       OpContext *ctx = op.ctx;
> >>>>>       op.pg = PGRef();
> >>>>>       op.ctx = nullptr;
> >>>>>       objectstore->reset_operation_token(op.token);
> >>>>>       if (pg->continue_op(
> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
> >>>>>               == -EAGAIN) {
> >>>>>         ++next_to_start;
> >>>>>         continue;
> >>>>>       }
> >>>>>     } else {
> >>>>>       break;
> >>>>>     }
> >>>>>   }
> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
> >>>>>   if (dq.second->do_op(
> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
> >>>>>           == -EAGAIN) {
> >>>>>     ++next_to_start;
> >>>>>   }
> >>>>> }
> >>>>>
> >>>>> A design like this would allow the op thread to move onto another task
> >>>>> if the
> >>>>> objectstore implementation wants to perform an async operation.  For
> >>>>> this
> >>>>> to work, there is some work to be done:
> >>>>>
> >>>>> 1. All current reads in the read and write paths (probably including
> >>>>> the attr
> >>>>>    reads in get_object_context and friends) need to be able to handle
> >>>>>    getting
> >>>>>    -EAGAIN from the objectstore.
> >>>>
> >>>> Can we leave the old read methods in place as blocking versions, and
> >>>> have
> >>>> them block on the token before returning?  That'll make the transition
> >>>> less painful.
> >>>>
> >>>>> 2. Writes and reads need to be able to handle having the pg lock
> >>>>> dropped
> >>>>>    during the operation.  This should be ok since the actual object
> >>>>>    information
> >>>>>    is protected by the RWState locks.
> >>>>
> >>>> All of the async write pieces already handle this (recheck PG state
> >>>> after
> >>>> taking the lock).  If they don't get -EAGAIN they'd just call the next
> >>>> stage, probably with a flag indicating that validation can be skipped
> >>>> (since the lock hasn't been dropped)?
> >>>>
> >>>>> 3. OpContext needs to have enough information to pick up where the
> >>>>> operation
> >>>>>    left off.  This suggests that we should obtain all required
> >>>>>    ObjectContexts
> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
> >>>>
> >>>> Yeah...
> >>>>
> >>>>> 4. The object class interface will need to be replaced with a new
> >>>>> interface
> >>>>>    based on possibly async reads.  We can maintain compatibility with
> >>>>>    the
> >>>>>    current ones by launching a new thread to handle any message which
> >>>>>    happens
> >>>>>    to contain an old-style object class operation.
> >>>>
> >>>> Again, for now, wrappers would avoid this?
> >>>>
> >>>> s
> >>>>>
> >>>>> Most of this needs to happen to support object class operations on ec
> >>>>> pools
> >>>>> anyway.
> >>>>> -Sam
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in
> >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
> 
> p: 646-253-9055
> e: milosz@xxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html