Re: Async reads, sync writes, op thread model discussion

Milosz Tanski <milosz@xxxxxxxxx> · Fri, 14 Aug 2015 18:36:49 -0400

On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin <mbenjamin@xxxxxxxxxx> wrote:
> Hi,
>
> I tend to agree with your comments regarding swapcontext/fibers.  I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either.

Not suggesting the libraries/frameworks. Just brining up promises as
an alternative technique to coroutines. Dealing with spaghetti
evented/callback code gets old after doing it for 10+ years. Then
throw in blocking IO.

And FYI, the data flow promises go back in comp sci back to the 80s.

Cheers,
- Milosz

>
> I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions.  I also would like to see how Yehuda's system works before arguing generalities.
>
> My intuition is, since the goal is more deterministic performance in a short horizion, you
>
> a. need to prioritize transparency over novel abstractions
> b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline
>
> My .05.
>
> Matt
>
> --
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-761-4689
> fax.  734-769-8938
> cel.  734-216-5309
>
> ----- Original Message -----
>> From: "Milosz Tanski" <milosz@xxxxxxxxx>
>> To: "Haomai Wang" <haomaiwang@xxxxxxxxx>
>> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@xxxxxxxxxx>, "Samuel Just" <sjust@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>,
>> ceph-devel@xxxxxxxxxxxxxxx
>> Sent: Friday, August 14, 2015 4:56:26 PM
>> Subject: Re: Async reads, sync writes, op thread model discussion
>>
>> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote:
>> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
>> > <ysadehwe@xxxxxxxxxx> wrote:
>> >> Already mentioned it on irc, adding to ceph-devel for the sake of
>> >> completeness. I did some infrastructure work for rgw and it seems (at
>> >> least to me) that it could at least be partially useful here.
>> >> Basically it's an async execution framework that utilizes coroutines.
>> >> It's comprised of aio notification manager that can also be tied into
>> >> coroutines execution. The coroutines themselves are stackless, they
>> >> are implemented as state machines, but using some boost trickery to
>> >> hide the details so they can be written very similar to blocking
>> >> methods. Coroutines can also execute other coroutines and can be
>> >> stacked, or can generate concurrent execution. It's still somewhat in
>> >> flux, but I think it's mostly done and already useful at this point,
>> >> so if there's anything you could use it might be a good idea to avoid
>> >> effort duplication.
>> >>
>> >
>> > coroutines like qemu is cool. The only thing I afraid is the
>> > complicate of debug and it's really a big task :-(
>> >
>> > I agree with sage that this design is really a new implementation for
>> > objectstore so that it's harmful to existing objectstore impl. I also
>> > suffer the pain from sync read xattr, we may add a async read
>> > interface to solove this?
>> >
>> > For context switch thing, now we have at least 3 cs for one op at osd
>> > side. messenger -> op queue -> objectstore queue. I guess op queue ->
>> > objectstore is easier to kick off just as sam said. We can make write
>> > journal inline with queue_transaction, so the caller could directly
>> > handle the transaction right now.
>>
>> I would caution agains coroutines (fibers) esp. in a multi-threaded
>> environment. Posix has officially obsoleted the swapcontext family of
>> functions in 1003.1-2004 and removed it in 1003.1-2008. That's because
>> they were notoriously non portable, and buggy. And yes you can use
>> something like boost::context / boost::coroutine instead but they also
>> have platform limitations. These implementations tend to abuse / turn
>> of various platform scrutiny features (like the one for
>> setjmp/longjmp). And on top of that many platforms don't consider
>> alternative context so you end up with obscure bugs. I've debugged my
>> fair share of bugs in Mordor coroutines with C++ exceptions, and errno
>> variables (since errno is really a function on linux and it's output a
>> pointer to threads errno is marked pure) if your coroutine migrates
>> threads. And you need to migrate them because of blocking and uneven
>> processor/thread distribution.
>>
>> None of these are obstacles that can't be solved, but added together
>> they become a pretty long term liability. So I think long and hard
>> about it. Qemu doesn't have some of those issues because it's uses a
>> single thread and a much simpler C ABI that it deals with.
>>
>> An alternative to coroutines that goes a long way towards solving the
>> callback spaghetti problem is futures/promises. I'm not talking of the
>> very future model that exists in C++11 library but more along the
>> lines that exist in other languages (like what's being done in
>> Javascript today). There's a good implementation of it Folly (the
>> facebook c++11 library). They have a very nice piece of documentation
>> here to understand how they work and how they differ.
>>
>> That future model is very handy when dealing with the callback control
>> flow problem. You can chain a bunch of processing steps that requires
>> some async action, return a future and continue so on and so forth.
>> Also, it makes handling complex error cases easy by giving you a way
>> to skip lots of processing steps strait to onError at the end of the
>> chain.
>>
>> Take a look at folly. Take a look at the expanded boost futures (they
>> call this future continuations:
>> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.futures.then
>> ). Also, building a cut down future framework just for Ceph (or
>> reduced set folly one) might be another option.
>>
>> Just an alternative.
>>
>> >
>> > Anyway, I think we need to do some changes for this field.
>> >
>> >> Yehuda
>> >>
>> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>> >>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
>> >>> to the actual interface I presented so much as the notion that the
>> >>> next thing to do is restructure the OpWQ users as async state
>> >>> machines.
>> >>> -Sam
>> >>>
>> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> >>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>> >>>>> Currently, there are some deficiencies in how the OSD maps ops onto
>> >>>>> threads:
>> >>>>>
>> >>>>> 1. Reads are always syncronous limiting the queue depth seen from the
>> >>>>> device
>> >>>>>    and therefore the possible parallelism.
>> >>>>> 2. Writes are always asyncronous forcing even very fast writes to be
>> >>>>> completed
>> >>>>>    in a seperate thread.
>> >>>>> 3. do_op cannot surrender the thread/pg lock during an operation
>> >>>>> forcing reads
>> >>>>>    required to continue the operation to be syncronous.
>> >>>>>
>> >>>>> For spinning disks, this is mostly ok since they don't benefit as much
>> >>>>> from
>> >>>>> large read queues, and writes (filestore with journal) are too slow for
>> >>>>> the
>> >>>>> thread switches to make a big difference.  For very fast flash,
>> >>>>> however, we
>> >>>>> want the flexibility to allow the backend to perform writes
>> >>>>> syncronously or
>> >>>>> asyncronously when it makes sense, and to maintain a larger number of
>> >>>>> outstanding reads than we have threads.  To that end, I suggest
>> >>>>> changing the
>> >>>>> ObjectStore interface to be somewhat polling based:
>> >>>>>
>> >>>>> /// Create new token
>> >>>>> void *create_operation_token() = 0;
>> >>>>> bool is_operation_complete(void *token) = 0;
>> >>>>> bool is_operation_committed(void *token) = 0;
>> >>>>> bool is_operation_applied(void *token) = 0;
>> >>>>> void wait_for_committed(void *token) = 0;
>> >>>>> void wait_for_applied(void *token) = 0;
>> >>>>> void wait_for_complete(void *token) = 0;
>> >>>>> /// Get result of operation
>> >>>>> int get_result(void *token) = 0;
>> >>>>> /// Must only be called once is_opearation_complete(token)
>> >>>>> void reset_operation_token(void *token) = 0;
>> >>>>> /// Must only be called once is_opearation_complete(token)
>> >>>>> void detroy_operation_token(void *token) = 0;
>> >>>>>
>> >>>>> /**
>> >>>>>  * Queue a transaction
>> >>>>>  *
>> >>>>>  * token must be either fresh or reset since the last operation.
>> >>>>>  * If the operation is completed syncronously, token can be resused
>> >>>>>  * without calling reset_operation_token.
>> >>>>>  *
>> >>>>>  * @result 0 if completed syncronously, -EAGAIN if async
>> >>>>>  */
>> >>>>> int queue_transaction(
>> >>>>>   Transaction *t,
>> >>>>>   OpSequencer *osr,
>> >>>>>   void *token
>> >>>>>   ) = 0;
>> >>>>>
>> >>>>> /**
>> >>>>>  * Queue a transaction
>> >>>>>  *
>> >>>>>  * token must be either fresh or reset since the last operation.
>> >>>>>  * If the operation is completed syncronously, token can be resused
>> >>>>>  * without calling reset_operation_token.
>> >>>>>  *
>> >>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>> >>>>>  */
>> >>>>> int read(..., void *token) = 0;
>> >>>>> ...
>> >>>>>
>> >>>>> The "token" concept here is opaque to allow the implementation some
>> >>>>> flexibility.  Ideally, it would be nice to be able to include libaio
>> >>>>> operation contexts directly.
>> >>>>>
>> >>>>> The main goal here is for the backend to have the freedom to complete
>> >>>>> writes and reads asyncronously or syncronously as the sitation
>> >>>>> warrants.
>> >>>>> It also leaves the interface user in control of where the operation
>> >>>>> completion is handled.  Each op thread can therefore handle its own
>> >>>>> completions:
>> >>>>>
>> >>>>> struct InProgressOp {
>> >>>>>   PGRef pg;
>> >>>>>   ObjectStore::Token *token;
>> >>>>>   OpContext *ctx;
>> >>>>> };
>> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>> >>>>
>> >>>> Probably a deque<> since we'll be pushign new requests and slurping off
>> >>>> completed ones?  Or, we can make token not completely opaque, so that it
>> >>>> includes a boost::intrusive::list node and can be strung on a
>> >>>> user-managed
>> >>>> queue.
>> >>>>
>> >>>>> for (auto op : in_progress) {
>> >>>>>   op.token = objectstore->create_operation_token();
>> >>>>> }
>> >>>>>
>> >>>>> uint64_t next_to_start = 0;
>> >>>>> uint64_t next_to_complete = 0;
>> >>>>>
>> >>>>> while (1) {
>> >>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>> >>>>>     objectstore->wait_for_complete(op.token);
>> >>>>>   }
>> >>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>> >>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>> >>>>>     if (objectstore->is_operation_complete(op.token)) {
>> >>>>>       PGRef pg = op.pg;
>> >>>>>       OpContext *ctx = op.ctx;
>> >>>>>       op.pg = PGRef();
>> >>>>>       op.ctx = nullptr;
>> >>>>>       objectstore->reset_operation_token(op.token);
>> >>>>>       if (pg->continue_op(
>> >>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>> >>>>>               == -EAGAIN) {
>> >>>>>         ++next_to_start;
>> >>>>>         continue;
>> >>>>>       }
>> >>>>>     } else {
>> >>>>>       break;
>> >>>>>     }
>> >>>>>   }
>> >>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>> >>>>>   if (dq.second->do_op(
>> >>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>> >>>>>           == -EAGAIN) {
>> >>>>>     ++next_to_start;
>> >>>>>   }
>> >>>>> }
>> >>>>>
>> >>>>> A design like this would allow the op thread to move onto another task
>> >>>>> if the
>> >>>>> objectstore implementation wants to perform an async operation.  For
>> >>>>> this
>> >>>>> to work, there is some work to be done:
>> >>>>>
>> >>>>> 1. All current reads in the read and write paths (probably including
>> >>>>> the attr
>> >>>>>    reads in get_object_context and friends) need to be able to handle
>> >>>>>    getting
>> >>>>>    -EAGAIN from the objectstore.
>> >>>>
>> >>>> Can we leave the old read methods in place as blocking versions, and
>> >>>> have
>> >>>> them block on the token before returning?  That'll make the transition
>> >>>> less painful.
>> >>>>
>> >>>>> 2. Writes and reads need to be able to handle having the pg lock
>> >>>>> dropped
>> >>>>>    during the operation.  This should be ok since the actual object
>> >>>>>    information
>> >>>>>    is protected by the RWState locks.
>> >>>>
>> >>>> All of the async write pieces already handle this (recheck PG state
>> >>>> after
>> >>>> taking the lock).  If they don't get -EAGAIN they'd just call the next
>> >>>> stage, probably with a flag indicating that validation can be skipped
>> >>>> (since the lock hasn't been dropped)?
>> >>>>
>> >>>>> 3. OpContext needs to have enough information to pick up where the
>> >>>>> operation
>> >>>>>    left off.  This suggests that we should obtain all required
>> >>>>>    ObjectContexts
>> >>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>> >>>>
>> >>>> Yeah...
>> >>>>
>> >>>>> 4. The object class interface will need to be replaced with a new
>> >>>>> interface
>> >>>>>    based on possibly async reads.  We can maintain compatibility with
>> >>>>>    the
>> >>>>>    current ones by launching a new thread to handle any message which
>> >>>>>    happens
>> >>>>>    to contain an old-style object class operation.
>> >>>>
>> >>>> Again, for now, wrappers would avoid this?
>> >>>>
>> >>>> s
>> >>>>>
>> >>>>> Most of this needs to happen to support object class operations on ec
>> >>>>> pools
>> >>>>> anyway.
>> >>>>> -Sam
>> >>>>> --
>> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >>>>> in
>> >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>>
>> >>>>>
>> >>> --
>> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> >
>> > Wheat
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Milosz Tanski
>> CTO
>> 16 East 34th Street, 15th floor
>> New York, NY 10016
>>
>> p: 646-253-9055
>> e: milosz@xxxxxxxxx
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html