Re: Async reads, sync writes, op thread model discussion

Haomai Wang <haomaiwang@xxxxxxxxx> · Wed, 12 Aug 2015 10:50:20 +0800

On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
<ysadehwe@xxxxxxxxxx> wrote:
> Already mentioned it on irc, adding to ceph-devel for the sake of
> completeness. I did some infrastructure work for rgw and it seems (at
> least to me) that it could at least be partially useful here.
> Basically it's an async execution framework that utilizes coroutines.
> It's comprised of aio notification manager that can also be tied into
> coroutines execution. The coroutines themselves are stackless, they
> are implemented as state machines, but using some boost trickery to
> hide the details so they can be written very similar to blocking
> methods. Coroutines can also execute other coroutines and can be
> stacked, or can generate concurrent execution. It's still somewhat in
> flux, but I think it's mostly done and already useful at this point,
> so if there's anything you could use it might be a good idea to avoid
> effort duplication.
>

coroutines like qemu is cool. The only thing I afraid is the
complicate of debug and it's really a big task :-(

I agree with sage that this design is really a new implementation for
objectstore so that it's harmful to existing objectstore impl. I also
suffer the pain from sync read xattr, we may add a async read
interface to solove this?

For context switch thing, now we have at least 3 cs for one op at osd
side. messenger -> op queue -> objectstore queue. I guess op queue ->
objectstore is easier to kick off just as sam said. We can make write
journal inline with queue_transaction, so the caller could directly
handle the transaction right now.

Anyway, I think we need to do some changes for this field.

> Yehuda
>
> On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
>> Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
>> to the actual interface I presented so much as the notion that the
>> next thing to do is restructure the OpWQ users as async state
>> machines.
>> -Sam
>>
>> On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>> On Tue, 11 Aug 2015, Samuel Just wrote:
>>>> Currently, there are some deficiencies in how the OSD maps ops onto threads:
>>>>
>>>> 1. Reads are always syncronous limiting the queue depth seen from the device
>>>>    and therefore the possible parallelism.
>>>> 2. Writes are always asyncronous forcing even very fast writes to be completed
>>>>    in a seperate thread.
>>>> 3. do_op cannot surrender the thread/pg lock during an operation forcing reads
>>>>    required to continue the operation to be syncronous.
>>>>
>>>> For spinning disks, this is mostly ok since they don't benefit as much from
>>>> large read queues, and writes (filestore with journal) are too slow for the
>>>> thread switches to make a big difference.  For very fast flash, however, we
>>>> want the flexibility to allow the backend to perform writes syncronously or
>>>> asyncronously when it makes sense, and to maintain a larger number of
>>>> outstanding reads than we have threads.  To that end, I suggest changing the
>>>> ObjectStore interface to be somewhat polling based:
>>>>
>>>> /// Create new token
>>>> void *create_operation_token() = 0;
>>>> bool is_operation_complete(void *token) = 0;
>>>> bool is_operation_committed(void *token) = 0;
>>>> bool is_operation_applied(void *token) = 0;
>>>> void wait_for_committed(void *token) = 0;
>>>> void wait_for_applied(void *token) = 0;
>>>> void wait_for_complete(void *token) = 0;
>>>> /// Get result of operation
>>>> int get_result(void *token) = 0;
>>>> /// Must only be called once is_opearation_complete(token)
>>>> void reset_operation_token(void *token) = 0;
>>>> /// Must only be called once is_opearation_complete(token)
>>>> void detroy_operation_token(void *token) = 0;
>>>>
>>>> /**
>>>>  * Queue a transaction
>>>>  *
>>>>  * token must be either fresh or reset since the last operation.
>>>>  * If the operation is completed syncronously, token can be resused
>>>>  * without calling reset_operation_token.
>>>>  *
>>>>  * @result 0 if completed syncronously, -EAGAIN if async
>>>>  */
>>>> int queue_transaction(
>>>>   Transaction *t,
>>>>   OpSequencer *osr,
>>>>   void *token
>>>>   ) = 0;
>>>>
>>>> /**
>>>>  * Queue a transaction
>>>>  *
>>>>  * token must be either fresh or reset since the last operation.
>>>>  * If the operation is completed syncronously, token can be resused
>>>>  * without calling reset_operation_token.
>>>>  *
>>>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>>>  */
>>>> int read(..., void *token) = 0;
>>>> ...
>>>>
>>>> The "token" concept here is opaque to allow the implementation some
>>>> flexibility.  Ideally, it would be nice to be able to include libaio
>>>> operation contexts directly.
>>>>
>>>> The main goal here is for the backend to have the freedom to complete
>>>> writes and reads asyncronously or syncronously as the sitation warrants.
>>>> It also leaves the interface user in control of where the operation
>>>> completion is handled.  Each op thread can therefore handle its own
>>>> completions:
>>>>
>>>> struct InProgressOp {
>>>>   PGRef pg;
>>>>   ObjectStore::Token *token;
>>>>   OpContext *ctx;
>>>> };
>>>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>>>
>>> Probably a deque<> since we'll be pushign new requests and slurping off
>>> completed ones?  Or, we can make token not completely opaque, so that it
>>> includes a boost::intrusive::list node and can be strung on a user-managed
>>> queue.
>>>
>>>> for (auto op : in_progress) {
>>>>   op.token = objectstore->create_operation_token();
>>>> }
>>>>
>>>> uint64_t next_to_start = 0;
>>>> uint64_t next_to_complete = 0;
>>>>
>>>> while (1) {
>>>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>     objectstore->wait_for_complete(op.token);
>>>>   }
>>>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>>>     if (objectstore->is_operation_complete(op.token)) {
>>>>       PGRef pg = op.pg;
>>>>       OpContext *ctx = op.ctx;
>>>>       op.pg = PGRef();
>>>>       op.ctx = nullptr;
>>>>       objectstore->reset_operation_token(op.token);
>>>>       if (pg->continue_op(
>>>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>               == -EAGAIN) {
>>>>         ++next_to_start;
>>>>         continue;
>>>>       }
>>>>     } else {
>>>>       break;
>>>>     }
>>>>   }
>>>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>>>   if (dq.second->do_op(
>>>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>>>           == -EAGAIN) {
>>>>     ++next_to_start;
>>>>   }
>>>> }
>>>>
>>>> A design like this would allow the op thread to move onto another task if the
>>>> objectstore implementation wants to perform an async operation.  For this
>>>> to work, there is some work to be done:
>>>>
>>>> 1. All current reads in the read and write paths (probably including the attr
>>>>    reads in get_object_context and friends) need to be able to handle getting
>>>>    -EAGAIN from the objectstore.
>>>
>>> Can we leave the old read methods in place as blocking versions, and have
>>> them block on the token before returning?  That'll make the transition
>>> less painful.
>>>
>>>> 2. Writes and reads need to be able to handle having the pg lock dropped
>>>>    during the operation.  This should be ok since the actual object information
>>>>    is protected by the RWState locks.
>>>
>>> All of the async write pieces already handle this (recheck PG state after
>>> taking the lock).  If they don't get -EAGAIN they'd just call the next
>>> stage, probably with a flag indicating that validation can be skipped
>>> (since the lock hasn't been dropped)?
>>>
>>>> 3. OpContext needs to have enough information to pick up where the operation
>>>>    left off.  This suggests that we should obtain all required ObjectContexts
>>>>    at the beginning of the operation.  Cache/Tiering complicates this.
>>>
>>> Yeah...
>>>
>>>> 4. The object class interface will need to be replaced with a new interface
>>>>    based on possibly async reads.  We can maintain compatibility with the
>>>>    current ones by launching a new thread to handle any message which happens
>>>>    to contain an old-style object class operation.
>>>
>>> Again, for now, wrappers would avoid this?
>>>
>>> s
>>>>
>>>> Most of this needs to happen to support object class operations on ec pools
>>>> anyway.
>>>> -Sam
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html