Hi, I tend to agree with your comments regarding swapcontext/fibers. I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either. I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions. I also would like to see how Yehuda's system works before arguing generalities. My intuition is, since the goal is more deterministic performance in a short horizion, you a. need to prioritize transparency over novel abstractions b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline My .05. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 ----- Original Message ----- > From: "Milosz Tanski" <milosz@xxxxxxxxx> > To: "Haomai Wang" <haomaiwang@xxxxxxxxx> > Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@xxxxxxxxxx>, "Samuel Just" <sjust@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, > ceph-devel@xxxxxxxxxxxxxxx > Sent: Friday, August 14, 2015 4:56:26 PM > Subject: Re: Async reads, sync writes, op thread model discussion > > On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote: > > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub > > <ysadehwe@xxxxxxxxxx> wrote: > >> Already mentioned it on irc, adding to ceph-devel for the sake of > >> completeness. I did some infrastructure work for rgw and it seems (at > >> least to me) that it could at least be partially useful here. > >> Basically it's an async execution framework that utilizes coroutines. > >> It's comprised of aio notification manager that can also be tied into > >> coroutines execution. The coroutines themselves are stackless, they > >> are implemented as state machines, but using some boost trickery to > >> hide the details so they can be written very similar to blocking > >> methods. Coroutines can also execute other coroutines and can be > >> stacked, or can generate concurrent execution. It's still somewhat in > >> flux, but I think it's mostly done and already useful at this point, > >> so if there's anything you could use it might be a good idea to avoid > >> effort duplication. > >> > > > > coroutines like qemu is cool. The only thing I afraid is the > > complicate of debug and it's really a big task :-( > > > > I agree with sage that this design is really a new implementation for > > objectstore so that it's harmful to existing objectstore impl. I also > > suffer the pain from sync read xattr, we may add a async read > > interface to solove this? > > > > For context switch thing, now we have at least 3 cs for one op at osd > > side. messenger -> op queue -> objectstore queue. I guess op queue -> > > objectstore is easier to kick off just as sam said. We can make write > > journal inline with queue_transaction, so the caller could directly > > handle the transaction right now. > > I would caution agains coroutines (fibers) esp. in a multi-threaded > environment. Posix has officially obsoleted the swapcontext family of > functions in 1003.1-2004 and removed it in 1003.1-2008. That's because > they were notoriously non portable, and buggy. And yes you can use > something like boost::context / boost::coroutine instead but they also > have platform limitations. These implementations tend to abuse / turn > of various platform scrutiny features (like the one for > setjmp/longjmp). And on top of that many platforms don't consider > alternative context so you end up with obscure bugs. I've debugged my > fair share of bugs in Mordor coroutines with C++ exceptions, and errno > variables (since errno is really a function on linux and it's output a > pointer to threads errno is marked pure) if your coroutine migrates > threads. And you need to migrate them because of blocking and uneven > processor/thread distribution. > > None of these are obstacles that can't be solved, but added together > they become a pretty long term liability. So I think long and hard > about it. Qemu doesn't have some of those issues because it's uses a > single thread and a much simpler C ABI that it deals with. > > An alternative to coroutines that goes a long way towards solving the > callback spaghetti problem is futures/promises. I'm not talking of the > very future model that exists in C++11 library but more along the > lines that exist in other languages (like what's being done in > Javascript today). There's a good implementation of it Folly (the > facebook c++11 library). They have a very nice piece of documentation > here to understand how they work and how they differ. > > That future model is very handy when dealing with the callback control > flow problem. You can chain a bunch of processing steps that requires > some async action, return a future and continue so on and so forth. > Also, it makes handling complex error cases easy by giving you a way > to skip lots of processing steps strait to onError at the end of the > chain. > > Take a look at folly. Take a look at the expanded boost futures (they > call this future continuations: > http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.futures.then > ). Also, building a cut down future framework just for Ceph (or > reduced set folly one) might be another option. > > Just an alternative. > > > > > Anyway, I think we need to do some changes for this field. > > > >> Yehuda > >> > >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: > >>> Yeah, I'm perfectly happy to have wrappers. I'm also not at all tied > >>> to the actual interface I presented so much as the notion that the > >>> next thing to do is restructure the OpWQ users as async state > >>> machines. > >>> -Sam > >>> > >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > >>>> On Tue, 11 Aug 2015, Samuel Just wrote: > >>>>> Currently, there are some deficiencies in how the OSD maps ops onto > >>>>> threads: > >>>>> > >>>>> 1. Reads are always syncronous limiting the queue depth seen from the > >>>>> device > >>>>> and therefore the possible parallelism. > >>>>> 2. Writes are always asyncronous forcing even very fast writes to be > >>>>> completed > >>>>> in a seperate thread. > >>>>> 3. do_op cannot surrender the thread/pg lock during an operation > >>>>> forcing reads > >>>>> required to continue the operation to be syncronous. > >>>>> > >>>>> For spinning disks, this is mostly ok since they don't benefit as much > >>>>> from > >>>>> large read queues, and writes (filestore with journal) are too slow for > >>>>> the > >>>>> thread switches to make a big difference. For very fast flash, > >>>>> however, we > >>>>> want the flexibility to allow the backend to perform writes > >>>>> syncronously or > >>>>> asyncronously when it makes sense, and to maintain a larger number of > >>>>> outstanding reads than we have threads. To that end, I suggest > >>>>> changing the > >>>>> ObjectStore interface to be somewhat polling based: > >>>>> > >>>>> /// Create new token > >>>>> void *create_operation_token() = 0; > >>>>> bool is_operation_complete(void *token) = 0; > >>>>> bool is_operation_committed(void *token) = 0; > >>>>> bool is_operation_applied(void *token) = 0; > >>>>> void wait_for_committed(void *token) = 0; > >>>>> void wait_for_applied(void *token) = 0; > >>>>> void wait_for_complete(void *token) = 0; > >>>>> /// Get result of operation > >>>>> int get_result(void *token) = 0; > >>>>> /// Must only be called once is_opearation_complete(token) > >>>>> void reset_operation_token(void *token) = 0; > >>>>> /// Must only be called once is_opearation_complete(token) > >>>>> void detroy_operation_token(void *token) = 0; > >>>>> > >>>>> /** > >>>>> * Queue a transaction > >>>>> * > >>>>> * token must be either fresh or reset since the last operation. > >>>>> * If the operation is completed syncronously, token can be resused > >>>>> * without calling reset_operation_token. > >>>>> * > >>>>> * @result 0 if completed syncronously, -EAGAIN if async > >>>>> */ > >>>>> int queue_transaction( > >>>>> Transaction *t, > >>>>> OpSequencer *osr, > >>>>> void *token > >>>>> ) = 0; > >>>>> > >>>>> /** > >>>>> * Queue a transaction > >>>>> * > >>>>> * token must be either fresh or reset since the last operation. > >>>>> * If the operation is completed syncronously, token can be resused > >>>>> * without calling reset_operation_token. > >>>>> * > >>>>> * @result -EAGAIN if async, 0 or -error otherwise. > >>>>> */ > >>>>> int read(..., void *token) = 0; > >>>>> ... > >>>>> > >>>>> The "token" concept here is opaque to allow the implementation some > >>>>> flexibility. Ideally, it would be nice to be able to include libaio > >>>>> operation contexts directly. > >>>>> > >>>>> The main goal here is for the backend to have the freedom to complete > >>>>> writes and reads asyncronously or syncronously as the sitation > >>>>> warrants. > >>>>> It also leaves the interface user in control of where the operation > >>>>> completion is handled. Each op thread can therefore handle its own > >>>>> completions: > >>>>> > >>>>> struct InProgressOp { > >>>>> PGRef pg; > >>>>> ObjectStore::Token *token; > >>>>> OpContext *ctx; > >>>>> }; > >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS); > >>>> > >>>> Probably a deque<> since we'll be pushign new requests and slurping off > >>>> completed ones? Or, we can make token not completely opaque, so that it > >>>> includes a boost::intrusive::list node and can be strung on a > >>>> user-managed > >>>> queue. > >>>> > >>>>> for (auto op : in_progress) { > >>>>> op.token = objectstore->create_operation_token(); > >>>>> } > >>>>> > >>>>> uint64_t next_to_start = 0; > >>>>> uint64_t next_to_complete = 0; > >>>>> > >>>>> while (1) { > >>>>> if (next_to_complete - next_to_start == MAX_IN_PROGRESS) { > >>>>> InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS]; > >>>>> objectstore->wait_for_complete(op.token); > >>>>> } > >>>>> for (; next_to_complete < next_to_start; ++next_to_complete) { > >>>>> InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS]; > >>>>> if (objectstore->is_operation_complete(op.token)) { > >>>>> PGRef pg = op.pg; > >>>>> OpContext *ctx = op.ctx; > >>>>> op.pg = PGRef(); > >>>>> op.ctx = nullptr; > >>>>> objectstore->reset_operation_token(op.token); > >>>>> if (pg->continue_op( > >>>>> ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS]) > >>>>> == -EAGAIN) { > >>>>> ++next_to_start; > >>>>> continue; > >>>>> } > >>>>> } else { > >>>>> break; > >>>>> } > >>>>> } > >>>>> pair<OpRequestRef, PGRef> dq = // get new request from queue; > >>>>> if (dq.second->do_op( > >>>>> dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS]) > >>>>> == -EAGAIN) { > >>>>> ++next_to_start; > >>>>> } > >>>>> } > >>>>> > >>>>> A design like this would allow the op thread to move onto another task > >>>>> if the > >>>>> objectstore implementation wants to perform an async operation. For > >>>>> this > >>>>> to work, there is some work to be done: > >>>>> > >>>>> 1. All current reads in the read and write paths (probably including > >>>>> the attr > >>>>> reads in get_object_context and friends) need to be able to handle > >>>>> getting > >>>>> -EAGAIN from the objectstore. > >>>> > >>>> Can we leave the old read methods in place as blocking versions, and > >>>> have > >>>> them block on the token before returning? That'll make the transition > >>>> less painful. > >>>> > >>>>> 2. Writes and reads need to be able to handle having the pg lock > >>>>> dropped > >>>>> during the operation. This should be ok since the actual object > >>>>> information > >>>>> is protected by the RWState locks. > >>>> > >>>> All of the async write pieces already handle this (recheck PG state > >>>> after > >>>> taking the lock). If they don't get -EAGAIN they'd just call the next > >>>> stage, probably with a flag indicating that validation can be skipped > >>>> (since the lock hasn't been dropped)? > >>>> > >>>>> 3. OpContext needs to have enough information to pick up where the > >>>>> operation > >>>>> left off. This suggests that we should obtain all required > >>>>> ObjectContexts > >>>>> at the beginning of the operation. Cache/Tiering complicates this. > >>>> > >>>> Yeah... > >>>> > >>>>> 4. The object class interface will need to be replaced with a new > >>>>> interface > >>>>> based on possibly async reads. We can maintain compatibility with > >>>>> the > >>>>> current ones by launching a new thread to handle any message which > >>>>> happens > >>>>> to contain an old-style object class operation. > >>>> > >>>> Again, for now, wrappers would avoid this? > >>>> > >>>> s > >>>>> > >>>>> Most of this needs to happen to support object class operations on ec > >>>>> pools > >>>>> anyway. > >>>>> -Sam > >>>>> -- > >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>>>> in > >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>>>> > >>>>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to majordomo@xxxxxxxxxxxxxxx > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Best Regards, > > > > Wheat > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Milosz Tanski > CTO > 16 East 34th Street, 15th floor > New York, NY 10016 > > p: 646-253-9055 > e: milosz@xxxxxxxxx > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html