On Fri, Aug 14, 2015 at 5:19 PM, Matt Benjamin <mbenjamin@xxxxxxxxxx> wrote: > Hi, > > I tend to agree with your comments regarding swapcontext/fibers. I am not much more enamored of jumping to new models (new! frameworks!) as a single jump, either. Not suggesting the libraries/frameworks. Just brining up promises as an alternative technique to coroutines. Dealing with spaghetti evented/callback code gets old after doing it for 10+ years. Then throw in blocking IO. And FYI, the data flow promises go back in comp sci back to the 80s. Cheers, - Milosz > > I like the way I interpreted Sam's design to be going, and in particular, that it seems to allow for consistent handling of read, write transactions. I also would like to see how Yehuda's system works before arguing generalities. > > My intuition is, since the goal is more deterministic performance in a short horizion, you > > a. need to prioritize transparency over novel abstractions > b. need to build solid microbenchmarks that encapsulate small, then larger pieces of the work pipeline > > My .05. > > Matt > > -- > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-761-4689 > fax. 734-769-8938 > cel. 734-216-5309 > > ----- Original Message ----- >> From: "Milosz Tanski" <milosz@xxxxxxxxx> >> To: "Haomai Wang" <haomaiwang@xxxxxxxxx> >> Cc: "Yehuda Sadeh-Weinraub" <ysadehwe@xxxxxxxxxx>, "Samuel Just" <sjust@xxxxxxxxxx>, "Sage Weil" <sage@xxxxxxxxxxxx>, >> ceph-devel@xxxxxxxxxxxxxxx >> Sent: Friday, August 14, 2015 4:56:26 PM >> Subject: Re: Async reads, sync writes, op thread model discussion >> >> On Tue, Aug 11, 2015 at 10:50 PM, Haomai Wang <haomaiwang@xxxxxxxxx> wrote: >> > On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub >> > <ysadehwe@xxxxxxxxxx> wrote: >> >> Already mentioned it on irc, adding to ceph-devel for the sake of >> >> completeness. I did some infrastructure work for rgw and it seems (at >> >> least to me) that it could at least be partially useful here. >> >> Basically it's an async execution framework that utilizes coroutines. >> >> It's comprised of aio notification manager that can also be tied into >> >> coroutines execution. The coroutines themselves are stackless, they >> >> are implemented as state machines, but using some boost trickery to >> >> hide the details so they can be written very similar to blocking >> >> methods. Coroutines can also execute other coroutines and can be >> >> stacked, or can generate concurrent execution. It's still somewhat in >> >> flux, but I think it's mostly done and already useful at this point, >> >> so if there's anything you could use it might be a good idea to avoid >> >> effort duplication. >> >> >> > >> > coroutines like qemu is cool. The only thing I afraid is the >> > complicate of debug and it's really a big task :-( >> > >> > I agree with sage that this design is really a new implementation for >> > objectstore so that it's harmful to existing objectstore impl. I also >> > suffer the pain from sync read xattr, we may add a async read >> > interface to solove this? >> > >> > For context switch thing, now we have at least 3 cs for one op at osd >> > side. messenger -> op queue -> objectstore queue. I guess op queue -> >> > objectstore is easier to kick off just as sam said. We can make write >> > journal inline with queue_transaction, so the caller could directly >> > handle the transaction right now. >> >> I would caution agains coroutines (fibers) esp. in a multi-threaded >> environment. Posix has officially obsoleted the swapcontext family of >> functions in 1003.1-2004 and removed it in 1003.1-2008. That's because >> they were notoriously non portable, and buggy. And yes you can use >> something like boost::context / boost::coroutine instead but they also >> have platform limitations. These implementations tend to abuse / turn >> of various platform scrutiny features (like the one for >> setjmp/longjmp). And on top of that many platforms don't consider >> alternative context so you end up with obscure bugs. I've debugged my >> fair share of bugs in Mordor coroutines with C++ exceptions, and errno >> variables (since errno is really a function on linux and it's output a >> pointer to threads errno is marked pure) if your coroutine migrates >> threads. And you need to migrate them because of blocking and uneven >> processor/thread distribution. >> >> None of these are obstacles that can't be solved, but added together >> they become a pretty long term liability. So I think long and hard >> about it. Qemu doesn't have some of those issues because it's uses a >> single thread and a much simpler C ABI that it deals with. >> >> An alternative to coroutines that goes a long way towards solving the >> callback spaghetti problem is futures/promises. I'm not talking of the >> very future model that exists in C++11 library but more along the >> lines that exist in other languages (like what's being done in >> Javascript today). There's a good implementation of it Folly (the >> facebook c++11 library). They have a very nice piece of documentation >> here to understand how they work and how they differ. >> >> That future model is very handy when dealing with the callback control >> flow problem. You can chain a bunch of processing steps that requires >> some async action, return a future and continue so on and so forth. >> Also, it makes handling complex error cases easy by giving you a way >> to skip lots of processing steps strait to onError at the end of the >> chain. >> >> Take a look at folly. Take a look at the expanded boost futures (they >> call this future continuations: >> http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.futures.then >> ). Also, building a cut down future framework just for Ceph (or >> reduced set folly one) might be another option. >> >> Just an alternative. >> >> > >> > Anyway, I think we need to do some changes for this field. >> > >> >> Yehuda >> >> >> >> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: >> >>> Yeah, I'm perfectly happy to have wrappers. I'm also not at all tied >> >>> to the actual interface I presented so much as the notion that the >> >>> next thing to do is restructure the OpWQ users as async state >> >>> machines. >> >>> -Sam >> >>> >> >>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> >>>> On Tue, 11 Aug 2015, Samuel Just wrote: >> >>>>> Currently, there are some deficiencies in how the OSD maps ops onto >> >>>>> threads: >> >>>>> >> >>>>> 1. Reads are always syncronous limiting the queue depth seen from the >> >>>>> device >> >>>>> and therefore the possible parallelism. >> >>>>> 2. Writes are always asyncronous forcing even very fast writes to be >> >>>>> completed >> >>>>> in a seperate thread. >> >>>>> 3. do_op cannot surrender the thread/pg lock during an operation >> >>>>> forcing reads >> >>>>> required to continue the operation to be syncronous. >> >>>>> >> >>>>> For spinning disks, this is mostly ok since they don't benefit as much >> >>>>> from >> >>>>> large read queues, and writes (filestore with journal) are too slow for >> >>>>> the >> >>>>> thread switches to make a big difference. For very fast flash, >> >>>>> however, we >> >>>>> want the flexibility to allow the backend to perform writes >> >>>>> syncronously or >> >>>>> asyncronously when it makes sense, and to maintain a larger number of >> >>>>> outstanding reads than we have threads. To that end, I suggest >> >>>>> changing the >> >>>>> ObjectStore interface to be somewhat polling based: >> >>>>> >> >>>>> /// Create new token >> >>>>> void *create_operation_token() = 0; >> >>>>> bool is_operation_complete(void *token) = 0; >> >>>>> bool is_operation_committed(void *token) = 0; >> >>>>> bool is_operation_applied(void *token) = 0; >> >>>>> void wait_for_committed(void *token) = 0; >> >>>>> void wait_for_applied(void *token) = 0; >> >>>>> void wait_for_complete(void *token) = 0; >> >>>>> /// Get result of operation >> >>>>> int get_result(void *token) = 0; >> >>>>> /// Must only be called once is_opearation_complete(token) >> >>>>> void reset_operation_token(void *token) = 0; >> >>>>> /// Must only be called once is_opearation_complete(token) >> >>>>> void detroy_operation_token(void *token) = 0; >> >>>>> >> >>>>> /** >> >>>>> * Queue a transaction >> >>>>> * >> >>>>> * token must be either fresh or reset since the last operation. >> >>>>> * If the operation is completed syncronously, token can be resused >> >>>>> * without calling reset_operation_token. >> >>>>> * >> >>>>> * @result 0 if completed syncronously, -EAGAIN if async >> >>>>> */ >> >>>>> int queue_transaction( >> >>>>> Transaction *t, >> >>>>> OpSequencer *osr, >> >>>>> void *token >> >>>>> ) = 0; >> >>>>> >> >>>>> /** >> >>>>> * Queue a transaction >> >>>>> * >> >>>>> * token must be either fresh or reset since the last operation. >> >>>>> * If the operation is completed syncronously, token can be resused >> >>>>> * without calling reset_operation_token. >> >>>>> * >> >>>>> * @result -EAGAIN if async, 0 or -error otherwise. >> >>>>> */ >> >>>>> int read(..., void *token) = 0; >> >>>>> ... >> >>>>> >> >>>>> The "token" concept here is opaque to allow the implementation some >> >>>>> flexibility. Ideally, it would be nice to be able to include libaio >> >>>>> operation contexts directly. >> >>>>> >> >>>>> The main goal here is for the backend to have the freedom to complete >> >>>>> writes and reads asyncronously or syncronously as the sitation >> >>>>> warrants. >> >>>>> It also leaves the interface user in control of where the operation >> >>>>> completion is handled. Each op thread can therefore handle its own >> >>>>> completions: >> >>>>> >> >>>>> struct InProgressOp { >> >>>>> PGRef pg; >> >>>>> ObjectStore::Token *token; >> >>>>> OpContext *ctx; >> >>>>> }; >> >>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS); >> >>>> >> >>>> Probably a deque<> since we'll be pushign new requests and slurping off >> >>>> completed ones? Or, we can make token not completely opaque, so that it >> >>>> includes a boost::intrusive::list node and can be strung on a >> >>>> user-managed >> >>>> queue. >> >>>> >> >>>>> for (auto op : in_progress) { >> >>>>> op.token = objectstore->create_operation_token(); >> >>>>> } >> >>>>> >> >>>>> uint64_t next_to_start = 0; >> >>>>> uint64_t next_to_complete = 0; >> >>>>> >> >>>>> while (1) { >> >>>>> if (next_to_complete - next_to_start == MAX_IN_PROGRESS) { >> >>>>> InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS]; >> >>>>> objectstore->wait_for_complete(op.token); >> >>>>> } >> >>>>> for (; next_to_complete < next_to_start; ++next_to_complete) { >> >>>>> InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS]; >> >>>>> if (objectstore->is_operation_complete(op.token)) { >> >>>>> PGRef pg = op.pg; >> >>>>> OpContext *ctx = op.ctx; >> >>>>> op.pg = PGRef(); >> >>>>> op.ctx = nullptr; >> >>>>> objectstore->reset_operation_token(op.token); >> >>>>> if (pg->continue_op( >> >>>>> ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS]) >> >>>>> == -EAGAIN) { >> >>>>> ++next_to_start; >> >>>>> continue; >> >>>>> } >> >>>>> } else { >> >>>>> break; >> >>>>> } >> >>>>> } >> >>>>> pair<OpRequestRef, PGRef> dq = // get new request from queue; >> >>>>> if (dq.second->do_op( >> >>>>> dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS]) >> >>>>> == -EAGAIN) { >> >>>>> ++next_to_start; >> >>>>> } >> >>>>> } >> >>>>> >> >>>>> A design like this would allow the op thread to move onto another task >> >>>>> if the >> >>>>> objectstore implementation wants to perform an async operation. For >> >>>>> this >> >>>>> to work, there is some work to be done: >> >>>>> >> >>>>> 1. All current reads in the read and write paths (probably including >> >>>>> the attr >> >>>>> reads in get_object_context and friends) need to be able to handle >> >>>>> getting >> >>>>> -EAGAIN from the objectstore. >> >>>> >> >>>> Can we leave the old read methods in place as blocking versions, and >> >>>> have >> >>>> them block on the token before returning? That'll make the transition >> >>>> less painful. >> >>>> >> >>>>> 2. Writes and reads need to be able to handle having the pg lock >> >>>>> dropped >> >>>>> during the operation. This should be ok since the actual object >> >>>>> information >> >>>>> is protected by the RWState locks. >> >>>> >> >>>> All of the async write pieces already handle this (recheck PG state >> >>>> after >> >>>> taking the lock). If they don't get -EAGAIN they'd just call the next >> >>>> stage, probably with a flag indicating that validation can be skipped >> >>>> (since the lock hasn't been dropped)? >> >>>> >> >>>>> 3. OpContext needs to have enough information to pick up where the >> >>>>> operation >> >>>>> left off. This suggests that we should obtain all required >> >>>>> ObjectContexts >> >>>>> at the beginning of the operation. Cache/Tiering complicates this. >> >>>> >> >>>> Yeah... >> >>>> >> >>>>> 4. The object class interface will need to be replaced with a new >> >>>>> interface >> >>>>> based on possibly async reads. We can maintain compatibility with >> >>>>> the >> >>>>> current ones by launching a new thread to handle any message which >> >>>>> happens >> >>>>> to contain an old-style object class operation. >> >>>> >> >>>> Again, for now, wrappers would avoid this? >> >>>> >> >>>> s >> >>>>> >> >>>>> Most of this needs to happen to support object class operations on ec >> >>>>> pools >> >>>>> anyway. >> >>>>> -Sam >> >>>>> -- >> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> >>>>> in >> >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >>>>> >> >>>>> >> >>> -- >> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > >> > >> > >> > -- >> > Best Regards, >> > >> > Wheat >> > -- >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Milosz Tanski >> CTO >> 16 East 34th Street, 15th floor >> New York, NY 10016 >> >> p: 646-253-9055 >> e: milosz@xxxxxxxxx >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: milosz@xxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html