Based on distant past experience (in more of an embedded system) using co-routines, I'd vote for perhaps not implementing them right away, as it becomes complex to follow and debug. As I understood it, each worker would have an additional queue to pull from: new incoming work and re-dispatching completed operations. In each case there's a limited set of states the operations will be in as they are pre-empted at specific points. Also having discrete queues for new work vs pending operations can allow balancing between the two which may be necessary. Either way I agree that deterministic operation (as well as a short per IO code path length for sync operations) would be the best outcome. Thanks, Stephen -----Original Message----- From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Matt Benjamin Sent: Friday, August 14, 2015 2:19 PM To: Milosz Tanski Cc: Haomai Wang; Yehuda Sadeh-Weinraub; Samuel Just; Sage Weil; ceph-devel@xxxxxxxxxxxxxxx Subject: Re: Async reads, sync writes, op thread model discussion Hi, I tend to agree with your comments regarding swapcontext/fibers. I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either. I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions. I also would like to see how Yehuda's system works before arguing generalities. My intuition is, since the goal is more deterministic performance in a short horizion, you a. need to prioritize transparency over novel abstractions b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline My .05. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 ----- Original Message ----- > From: "Milosz Tanski" <milosz@xxxxxxxxx> > To: "Haomai Wang" <haomaiwang@xxxxxxxxx> > Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@xxxxxxxxxx>, "Samuel Just" > <sjust@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, > ceph-devel@xxxxxxxxxxxxxxx > Sent: Friday, August 14, 2015 4:56:26 PM > Subject: Re: Async reads, sync writes, op thread model discussion > > On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote: > > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub > > <ysadehwe@xxxxxxxxxx> wrote: > >> Already mentioned it on irc, adding to ceph-devel for the sake of > >> completeness. I did some infrastructure work for rgw and it seems > >> (at least to me) that it could at least be partially useful here. > >> Basically it's an async execution framework that utilizes coroutines. > >> It's comprised of aio notification manager that can also be tied > >> into coroutines execution. The coroutines themselves are stackless, > >> they are implemented as state machines, but using some boost > >> trickery to hide the details so they can be written very similar to > >> blocking methods. Coroutines can also execute other coroutines and > >> can be stacked, or can generate concurrent execution. It's still > >> somewhat in flux, but I think it's mostly done and already useful > >> at this point, so if there's anything you could use it might be a > >> good idea to avoid effort duplication. > >> > > > > coroutines like qemu is cool. The only thing I afraid is the > > complicate of debug and it's really a big task :-( > > > > I agree with sage that this design is really a new implementation > > for objectstore so that it's harmful to existing objectstore impl. I > > also suffer the pain from sync read xattr, we may add a async read > > interface to solove this? > > > > For context switch thing, now we have at least 3 cs for one op at > > osd side. messenger -> op queue -> objectstore queue. I guess op > > queue -> objectstore is easier to kick off just as sam said. We can > > make write journal inline with queue_transaction, so the caller > > could directly handle the transaction right now. > > I would caution agains coroutines (fibers) esp. in a multi-threaded > environment. Posix has officially obsoleted the swapcontext family of > functions in 1003.1-2004 and removed it in 1003.1-2008. That's because > they were notoriously non portable, and buggy. And yes you can use > something like boost::context / boost::coroutine instead but they also > have platform limitations. These implementations tend to abuse / turn > of various platform scrutiny features (like the one for > setjmp/longjmp). And on top of that many platforms don't consider > alternative context so you end up with obscure bugs. I've debugged my > fair share of bugs in Mordor coroutines with C++ exceptions, and errno > variables (since errno is really a function on linux and it's output a > pointer to threads errno is marked pure) if your coroutine migrates > threads. And you need to migrate them because of blocking and uneven > processor/thread distribution. > > None of these are obstacles that can't be solved, but added together > they become a pretty long term liability. So I think long and hard > about it. Qemu doesn't have some of those issues because it's uses a > single thread and a much simpler C ABI that it deals with. > > An alternative to coroutines that goes a long way towards solving the > callback spaghetti problem is futures/promises. I'm not talking of the > very future model that exists in C++11 library but more along the > lines that exist in other languages (like what's being done in > Javascript today). There's a good implementation of it Folly (the > facebook c++11 library). They have a very nice piece of documentation > here to understand how they work and how they differ. > > That future model is very handy when dealing with the callback control > flow problem. You can chain a bunch of processing steps that requires > some async action, return a future and continue so on and so forth. > Also, it makes handling complex error cases easy by giving you a way > to skip lots of processing steps strait to onError at the end of the > chain. > > Take a look at folly. Take a look at the expanded boost futures (they > call this future continuations: > http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.h > tml#thread.synchronization.futures.then > ). Also, building a cut down future framework just for Ceph (or > reduced set folly one) might be another option. > > Just an alternative. > > > > > Anyway, I think we need to do some changes for this field. > > > >> Yehuda > >> > >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: > >>> Yeah, I'm perfectly happy to have wrappers. I'm also not at all > >>> tied to the actual interface I presented so much as the notion > >>> that the next thing to do is restructure the OpWQ users as async > >>> state machines. > >>> -Sam > >>> > >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >>>> On Tue, 11 Aug 2015, Samuel Just wrote: > >>>>> Currently, there are some deficiencies in how the OSD maps ops > >>>>> onto > >>>>> threads: > >>>>> > >>>>> 1. Reads are always syncronous limiting the queue depth seen > >>>>> from the device > >>>>> and therefore the possible parallelism. > >>>>> 2. Writes are always asyncronous forcing even very fast writes > >>>>> to be completed > >>>>> in a seperate thread. > >>>>> 3. do_op cannot surrender the thread/pg lock during an operation > >>>>> forcing reads > >>>>> required to continue the operation to be syncronous. > >>>>> > >>>>> For spinning disks, this is mostly ok since they don't benefit > >>>>> as much from large read queues, and writes (filestore with > >>>>> journal) are too slow for the thread switches to make a big > >>>>> difference. For very fast flash, however, we want the > >>>>> flexibility to allow the backend to perform writes syncronously > >>>>> or asyncronously when it makes sense, and to maintain a larger > >>>>> number of outstanding reads than we have threads. To that end, > >>>>> I suggest changing the ObjectStore interface to be somewhat > >>>>> polling based: > >>>>> > >>>>> /// Create new token > >>>>> void *create_operation_token() = 0; bool > >>>>> is_operation_complete(void *token) = 0; bool > >>>>> is_operation_committed(void *token) = 0; bool > >>>>> is_operation_applied(void *token) = 0; void > >>>>> wait_for_committed(void *token) = 0; void wait_for_applied(void > >>>>> *token) = 0; void wait_for_complete(void *token) = 0; /// Get > >>>>> result of operation int get_result(void *token) = 0; /// Must > >>>>> only be called once is_opearation_complete(token) void > >>>>> reset_operation_token(void *token) = 0; /// Must only be called > >>>>> once is_opearation_complete(token) void > >>>>> detroy_operation_token(void *token) = 0; > >>>>> > >>>>> /** > >>>>> * Queue a transaction > >>>>> * > >>>>> * token must be either fresh or reset since the last operation. > >>>>> * If the operation is completed syncronously, token can be > >>>>> resused > >>>>> * without calling reset_operation_token. > >>>>> * > >>>>> * @result 0 if completed syncronously, -EAGAIN if async */ int > >>>>> queue_transaction( > >>>>> Transaction *t, > >>>>> OpSequencer *osr, > >>>>> void *token > >>>>> ) = 0; > >>>>> > >>>>> /** > >>>>> * Queue a transaction > >>>>> * > >>>>> * token must be either fresh or reset since the last operation. > >>>>> * If the operation is completed syncronously, token can be > >>>>> resused > >>>>> * without calling reset_operation_token. > >>>>> * > >>>>> * @result -EAGAIN if async, 0 or -error otherwise. > >>>>> */ > >>>>> int read(..., void *token) = 0; > >>>>> ... > >>>>> > >>>>> The "token" concept here is opaque to allow the implementation > >>>>> some flexibility. Ideally, it would be nice to be able to > >>>>> include libaio operation contexts directly. > >>>>> > >>>>> The main goal here is for the backend to have the freedom to > >>>>> complete writes and reads asyncronously or syncronously as the > >>>>> sitation warrants. > >>>>> It also leaves the interface user in control of where the > >>>>> operation completion is handled. Each op thread can therefore > >>>>> handle its own > >>>>> completions: > >>>>> > >>>>> struct InProgressOp { > >>>>> PGRef pg; > >>>>> ObjectStore::Token *token; > >>>>> OpContext *ctx; > >>>>> }; > >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS); > >>>> > >>>> Probably a deque<> since we'll be pushign new requests and > >>>> slurping off completed ones? Or, we can make token not > >>>> completely opaque, so that it includes a boost::intrusive::list > >>>> node and can be strung on a user-managed queue. > >>>> > >>>>> for (auto op : in_progress) { > >>>>> op.token = objectstore->create_operation_token(); > >>>>> } > >>>>> > >>>>> uint64_t next_to_start = 0; > >>>>> uint64_t next_to_complete = 0; > >>>>> > >>>>> while (1) { > >>>>> if (next_to_complete - next_to_start == MAX_IN_PROGRESS) { > >>>>> InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS]; > >>>>> objectstore->wait_for_complete(op.token); > >>>>> } > >>>>> for (; next_to_complete < next_to_start; ++next_to_complete) { > >>>>> InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS]; > >>>>> if (objectstore->is_operation_complete(op.token)) { > >>>>> PGRef pg = op.pg; > >>>>> OpContext *ctx = op.ctx; > >>>>> op.pg = PGRef(); > >>>>> op.ctx = nullptr; > >>>>> objectstore->reset_operation_token(op.token); > >>>>> if (pg->continue_op( > >>>>> ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS]) > >>>>> == -EAGAIN) { > >>>>> ++next_to_start; > >>>>> continue; > >>>>> } > >>>>> } else { > >>>>> break; > >>>>> } > >>>>> } > >>>>> pair<OpRequestRef, PGRef> dq = // get new request from queue; > >>>>> if (dq.second->do_op( > >>>>> dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS]) > >>>>> == -EAGAIN) { > >>>>> ++next_to_start; > >>>>> } > >>>>> } > >>>>> > >>>>> A design like this would allow the op thread to move onto > >>>>> another task if the objectstore implementation wants to perform > >>>>> an async operation. For this to work, there is some work to be > >>>>> done: > >>>>> > >>>>> 1. All current reads in the read and write paths (probably > >>>>> including the attr > >>>>> reads in get_object_context and friends) need to be able to handle > >>>>> getting > >>>>> -EAGAIN from the objectstore. > >>>> > >>>> Can we leave the old read methods in place as blocking versions, > >>>> and have them block on the token before returning? That'll make > >>>> the transition less painful. > >>>> > >>>>> 2. Writes and reads need to be able to handle having the pg lock > >>>>> dropped > >>>>> during the operation. This should be ok since the actual object > >>>>> information > >>>>> is protected by the RWState locks. > >>>> > >>>> All of the async write pieces already handle this (recheck PG > >>>> state after taking the lock). If they don't get -EAGAIN they'd > >>>> just call the next stage, probably with a flag indicating that > >>>> validation can be skipped (since the lock hasn't been dropped)? > >>>> > >>>>> 3. OpContext needs to have enough information to pick up where > >>>>> the operation > >>>>> left off. This suggests that we should obtain all required > >>>>> ObjectContexts > >>>>> at the beginning of the operation. Cache/Tiering complicates this. > >>>> > >>>> Yeah... > >>>> > >>>>> 4. The object class interface will need to be replaced with a > >>>>> new interface > >>>>> based on possibly async reads. We can maintain compatibility with > >>>>> the > >>>>> current ones by launching a new thread to handle any message which > >>>>> happens > >>>>> to contain an old-style object class operation. > >>>> > >>>> Again, for now, wrappers would avoid this? > >>>> > >>>> s > >>>>> > >>>>> Most of this needs to happen to support object class operations > >>>>> on ec pools anyway. > >>>>> -Sam > >>>>> -- > >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>>>> in > >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More > >>>>> majordomo info at http://vger.kernel.org/majordomo-info.html > >>>>> > >>>>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe > >>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> -- > >> To unsubscribe from this list: send the line "unsubscribe > >> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Best Regards, > > > > Wheat > > -- > > To unsubscribe from this list: send the line "unsubscribe > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Milosz Tanski > CTO > 16 East 34th Street, 15th floor > New York, NY 10016 > > p: 646-253-9055 > e: milosz@xxxxxxxxx > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f