RE: Async reads, sync writes, op thread model discussion

"Blinick, Stephen L" <stephen.l.blinick@xxxxxxxxx> · Fri, 14 Aug 2015 21:39:54 +0000

Based on distant past experience (in more of an embedded system) using co-routines, I'd vote for perhaps not implementing them right away, as it becomes complex to follow and debug.    As I understood it, each worker would have an additional queue to pull from: new incoming work and re-dispatching completed operations. In each case there's a limited set of states the operations will be in as they are pre-empted at specific points.  Also having discrete queues for new work vs pending operations can allow balancing between the two which may be necessary.

Either way I agree that deterministic operation (as well as a short per IO code path length for sync operations) would be the best outcome.

Thanks,

Stephen

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Matt Benjamin
Sent: Friday, August 14, 2015 2:19 PM
To: Milosz Tanski
Cc: Haomai Wang; Yehuda Sadeh-Weinraub; Samuel Just; Sage Weil; ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: Async reads, sync writes, op thread model discussion

Hi,

I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.

I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.

My intuition is, since the goal is more deterministic performance in a short horizion, you

a. need to prioritize transparency over novel abstractions b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline

My .05.

Matt

--
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-761-4689
fax.  734-769-8938
cel.  734-216-5309

----- Original Message -----
> From: "Milosz Tanski" <milosz@xxxxxxxxx>
> To: "Haomai Wang" <haomaiwang@xxxxxxxxx>
> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@xxxxxxxxxx>, "Samuel Just" 
> <sjust@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, 
> ceph-devel@xxxxxxxxxxxxxxx
> Sent: Friday, August 14, 2015 4:56:26 PM
> Subject: Re: Async reads, sync writes, op thread model discussion
> 
> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub 
> > <ysadehwe@xxxxxxxxxx> wrote:
> >> Already mentioned it on irc, adding to ceph-devel for the sake of 
> >> completeness. I did some infrastructure work for rgw and it seems 
> >> (at least to me) that it could at least be partially useful here.
> >> Basically it's an async execution framework that utilizes coroutines.
> >> It's comprised of aio notification manager that can also be tied 
> >> into coroutines execution. The coroutines themselves are stackless, 
> >> they are implemented as state machines, but using some boost 
> >> trickery to hide the details so they can be written very similar to 
> >> blocking methods. Coroutines can also execute other coroutines and 
> >> can be stacked, or can generate concurrent execution. It's still 
> >> somewhat in flux, but I think it's mostly done and already useful 
> >> at this point, so if there's anything you could use it might be a 
> >> good idea to avoid effort duplication.
> >>
> >
> > coroutines like qemu is cool. The only thing I afraid is the 
> > complicate of debug and it's really a big task :-(
> >
> > I agree with sage that this design is really a new implementation 
> > for objectstore so that it's harmful to existing objectstore impl. I 
> > also suffer the pain from sync read xattr, we may add a async read 
> > interface to solove this?
> >
> > For context switch thing, now we have at least 3 cs for one op at 
> > osd side. messenger -> op queue -> objectstore queue. I guess op 
> > queue -> objectstore is easier to kick off just as sam said. We can 
> > make write journal inline with queue_transaction, so the caller 
> > could directly handle the transaction right now.
> 
> I would caution agains coroutines (fibers) esp. in a multi-threaded 
> environment. Posix has officially obsoleted the swapcontext family of 
> functions in 1003.1-2004 and removed it in 1003.1-2008. That's because 
> they were notoriously non portable, and buggy. And yes you can use 
> something like boost::context / boost::coroutine instead but they also 
> have platform limitations. These implementations tend to abuse / turn 
> of various platform scrutiny features (like the one for 
> setjmp/longjmp). And on top of that many platforms don't consider 
> alternative context so you end up with obscure bugs. I've debugged my 
> fair share of bugs in Mordor coroutines with C++ exceptions, and errno 
> variables (since errno is really a function on linux and it's output a 
> pointer to threads errno is marked pure) if your coroutine migrates 
> threads. And you need to migrate them because of blocking and uneven 
> processor/thread distribution.
> 
> None of these are obstacles that can't be solved, but added together 
> they become a pretty long term liability. So I think long and hard 
> about it. Qemu doesn't have some of those issues because it's uses a 
> single thread and a much simpler C ABI that it deals with.
> 
> An alternative to coroutines that goes a long way towards solving the 
> callback spaghetti problem is futures/promises. I'm not talking of the 
> very future model that exists in C++11 library but more along the 
> lines that exist in other languages (like what's being done in 
> Javascript today). There's a good implementation of it Folly (the 
> facebook c++11 library). They have a very nice piece of documentation 
> here to understand how they work and how they differ.
> 
> That future model is very handy when dealing with the callback control 
> flow problem. You can chain a bunch of processing steps that requires 
> some async action, return a future and continue so on and so forth.
> Also, it makes handling complex error cases easy by giving you a way 
> to skip lots of processing steps strait to onError at the end of the 
> chain.
> 
> Take a look at folly. Take a look at the expanded boost futures (they 
> call this future continuations:
> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.h
> tml#thread.synchronization.futures.then
> ). Also, building a cut down future framework just for Ceph (or 
> reduced set folly one) might be another option.
> 
> Just an alternative.
> 
> >
> > Anyway, I think we need to do some changes for this field.
> >
> >> Yehuda
> >>
> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all 
> >>> tied to the actual interface I presented so much as the notion 
> >>> that the next thing to do is restructure the OpWQ users as async 
> >>> state machines.
> >>> -Sam
> >>>
> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
> >>>>> Currently, there are some deficiencies in how the OSD maps ops 
> >>>>> onto
> >>>>> threads:
> >>>>>
> >>>>> 1. Reads are always syncronous limiting the queue depth seen 
> >>>>> from the device
> >>>>>    and therefore the possible parallelism.
> >>>>> 2. Writes are always asyncronous forcing even very fast writes 
> >>>>> to be completed
> >>>>>    in a seperate thread.
> >>>>> 3. do_op cannot surrender the thread/pg lock during an operation 
> >>>>> forcing reads
> >>>>>    required to continue the operation to be syncronous.
> >>>>>
> >>>>> For spinning disks, this is mostly ok since they don't benefit 
> >>>>> as much from large read queues, and writes (filestore with 
> >>>>> journal) are too slow for the thread switches to make a big 
> >>>>> difference.  For very fast flash, however, we want the 
> >>>>> flexibility to allow the backend to perform writes syncronously 
> >>>>> or asyncronously when it makes sense, and to maintain a larger 
> >>>>> number of outstanding reads than we have threads.  To that end, 
> >>>>> I suggest changing the ObjectStore interface to be somewhat 
> >>>>> polling based:
> >>>>>
> >>>>> /// Create new token
> >>>>> void *create_operation_token() = 0; bool 
> >>>>> is_operation_complete(void *token) = 0; bool 
> >>>>> is_operation_committed(void *token) = 0; bool 
> >>>>> is_operation_applied(void *token) = 0; void 
> >>>>> wait_for_committed(void *token) = 0; void wait_for_applied(void 
> >>>>> *token) = 0; void wait_for_complete(void *token) = 0; /// Get 
> >>>>> result of operation int get_result(void *token) = 0; /// Must 
> >>>>> only be called once is_opearation_complete(token) void 
> >>>>> reset_operation_token(void *token) = 0; /// Must only be called 
> >>>>> once is_opearation_complete(token) void 
> >>>>> detroy_operation_token(void *token) = 0;
> >>>>>
> >>>>> /**
> >>>>>  * Queue a transaction
> >>>>>  *
> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>>  * If the operation is completed syncronously, token can be 
> >>>>> resused
> >>>>>  * without calling reset_operation_token.
> >>>>>  *
> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async  */ int 
> >>>>> queue_transaction(
> >>>>>   Transaction *t,
> >>>>>   OpSequencer *osr,
> >>>>>   void *token
> >>>>>   ) = 0;
> >>>>>
> >>>>> /**
> >>>>>  * Queue a transaction
> >>>>>  *
> >>>>>  * token must be either fresh or reset since the last operation.
> >>>>>  * If the operation is completed syncronously, token can be 
> >>>>> resused
> >>>>>  * without calling reset_operation_token.
> >>>>>  *
> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
> >>>>>  */
> >>>>> int read(..., void *token) = 0;
> >>>>> ...
> >>>>>
> >>>>> The "token" concept here is opaque to allow the implementation 
> >>>>> some flexibility.  Ideally, it would be nice to be able to 
> >>>>> include libaio operation contexts directly.
> >>>>>
> >>>>> The main goal here is for the backend to have the freedom to 
> >>>>> complete writes and reads asyncronously or syncronously as the 
> >>>>> sitation warrants.
> >>>>> It also leaves the interface user in control of where the 
> >>>>> operation completion is handled.  Each op thread can therefore 
> >>>>> handle its own
> >>>>> completions:
> >>>>>
> >>>>> struct InProgressOp {
> >>>>>   PGRef pg;
> >>>>>   ObjectStore::Token *token;
> >>>>>   OpContext *ctx;
> >>>>> };
> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
> >>>>
> >>>> Probably a deque<> since we'll be pushign new requests and 
> >>>> slurping off completed ones?  Or, we can make token not 
> >>>> completely opaque, so that it includes a boost::intrusive::list 
> >>>> node and can be strung on a user-managed queue.
> >>>>
> >>>>> for (auto op : in_progress) {
> >>>>>   op.token = objectstore->create_operation_token();
> >>>>> }
> >>>>>
> >>>>> uint64_t next_to_start = 0;
> >>>>> uint64_t next_to_complete = 0;
> >>>>>
> >>>>> while (1) {
> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
> >>>>>     objectstore->wait_for_complete(op.token);
> >>>>>   }
> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
> >>>>>     if (objectstore->is_operation_complete(op.token)) {
> >>>>>       PGRef pg = op.pg;
> >>>>>       OpContext *ctx = op.ctx;
> >>>>>       op.pg = PGRef();
> >>>>>       op.ctx = nullptr;
> >>>>>       objectstore->reset_operation_token(op.token);
> >>>>>       if (pg->continue_op(
> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
> >>>>>               == -EAGAIN) {
> >>>>>         ++next_to_start;
> >>>>>         continue;
> >>>>>       }
> >>>>>     } else {
> >>>>>       break;
> >>>>>     }
> >>>>>   }
> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
> >>>>>   if (dq.second->do_op(
> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
> >>>>>           == -EAGAIN) {
> >>>>>     ++next_to_start;
> >>>>>   }
> >>>>> }
> >>>>>
> >>>>> A design like this would allow the op thread to move onto 
> >>>>> another task if the objectstore implementation wants to perform 
> >>>>> an async operation.  For this to work, there is some work to be 
> >>>>> done:
> >>>>>
> >>>>> 1. All current reads in the read and write paths (probably 
> >>>>> including the attr
> >>>>>    reads in get_object_context and friends) need to be able to handle
> >>>>>    getting
> >>>>>    -EAGAIN from the objectstore.
> >>>>
> >>>> Can we leave the old read methods in place as blocking versions, 
> >>>> and have them block on the token before returning?  That'll make 
> >>>> the transition less painful.
> >>>>
> >>>>> 2. Writes and reads need to be able to handle having the pg lock 
> >>>>> dropped
> >>>>>    during the operation.  This should be ok since the actual object
> >>>>>    information
> >>>>>    is protected by the RWState locks.
> >>>>
> >>>> All of the async write pieces already handle this (recheck PG 
> >>>> state after taking the lock).  If they don't get -EAGAIN they'd 
> >>>> just call the next stage, probably with a flag indicating that 
> >>>> validation can be skipped (since the lock hasn't been dropped)?
> >>>>
> >>>>> 3. OpContext needs to have enough information to pick up where 
> >>>>> the operation
> >>>>>    left off.  This suggests that we should obtain all required
> >>>>>    ObjectContexts
> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
> >>>>
> >>>> Yeah...
> >>>>
> >>>>> 4. The object class interface will need to be replaced with a 
> >>>>> new interface
> >>>>>    based on possibly async reads.  We can maintain compatibility with
> >>>>>    the
> >>>>>    current ones by launching a new thread to handle any message which
> >>>>>    happens
> >>>>>    to contain an old-style object class operation.
> >>>>
> >>>> Again, for now, wrappers would avoid this?
> >>>>
> >>>> s
> >>>>>
> >>>>> Most of this needs to happen to support object class operations 
> >>>>> on ec pools anyway.
> >>>>> -Sam
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in
> >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More 
> >>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe 
> >>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx 
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe 
> >> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx 
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> > --
> > To unsubscribe from this list: send the line "unsubscribe 
> > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx 
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
> 
> p: 646-253-9055
> e: milosz@xxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f