Re: Async reads, sync writes, op thread model discussion

Samuel Just <sjust@xxxxxxxxxx> · Tue, 11 Aug 2015 15:19:32 -0700



Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
to the actual interface I presented so much as the notion that the
next thing to do is restructure the OpWQ users as async state
machines.
-Sam

On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Tue, 11 Aug 2015, Samuel Just wrote:
>> Currently, there are some deficiencies in how the OSD maps ops onto threads:
>>
>> 1. Reads are always syncronous limiting the queue depth seen from the device
>>    and therefore the possible parallelism.
>> 2. Writes are always asyncronous forcing even very fast writes to be completed
>>    in a seperate thread.
>> 3. do_op cannot surrender the thread/pg lock during an operation forcing reads
>>    required to continue the operation to be syncronous.
>>
>> For spinning disks, this is mostly ok since they don't benefit as much from
>> large read queues, and writes (filestore with journal) are too slow for the
>> thread switches to make a big difference.  For very fast flash, however, we
>> want the flexibility to allow the backend to perform writes syncronously or
>> asyncronously when it makes sense, and to maintain a larger number of
>> outstanding reads than we have threads.  To that end, I suggest changing the
>> ObjectStore interface to be somewhat polling based:
>>
>> /// Create new token
>> void *create_operation_token() = 0;
>> bool is_operation_complete(void *token) = 0;
>> bool is_operation_committed(void *token) = 0;
>> bool is_operation_applied(void *token) = 0;
>> void wait_for_committed(void *token) = 0;
>> void wait_for_applied(void *token) = 0;
>> void wait_for_complete(void *token) = 0;
>> /// Get result of operation
>> int get_result(void *token) = 0;
>> /// Must only be called once is_opearation_complete(token)
>> void reset_operation_token(void *token) = 0;
>> /// Must only be called once is_opearation_complete(token)
>> void detroy_operation_token(void *token) = 0;
>>
>> /**
>>  * Queue a transaction
>>  *
>>  * token must be either fresh or reset since the last operation.
>>  * If the operation is completed syncronously, token can be resused
>>  * without calling reset_operation_token.
>>  *
>>  * @result 0 if completed syncronously, -EAGAIN if async
>>  */
>> int queue_transaction(
>>   Transaction *t,
>>   OpSequencer *osr,
>>   void *token
>>   ) = 0;
>>
>> /**
>>  * Queue a transaction
>>  *
>>  * token must be either fresh or reset since the last operation.
>>  * If the operation is completed syncronously, token can be resused
>>  * without calling reset_operation_token.
>>  *
>>  * @result -EAGAIN if async, 0 or -error otherwise.
>>  */
>> int read(..., void *token) = 0;
>> ...
>>
>> The "token" concept here is opaque to allow the implementation some
>> flexibility.  Ideally, it would be nice to be able to include libaio
>> operation contexts directly.
>>
>> The main goal here is for the backend to have the freedom to complete
>> writes and reads asyncronously or syncronously as the sitation warrants.
>> It also leaves the interface user in control of where the operation
>> completion is handled.  Each op thread can therefore handle its own
>> completions:
>>
>> struct InProgressOp {
>>   PGRef pg;
>>   ObjectStore::Token *token;
>>   OpContext *ctx;
>> };
>> vector<InProgressOp> in_progress(MAX_IN_PROGRESS);
>
> Probably a deque<> since we'll be pushign new requests and slurping off
> completed ones?  Or, we can make token not completely opaque, so that it
> includes a boost::intrusive::list node and can be strung on a user-managed
> queue.
>
>> for (auto op : in_progress) {
>>   op.token = objectstore->create_operation_token();
>> }
>>
>> uint64_t next_to_start = 0;
>> uint64_t next_to_complete = 0;
>>
>> while (1) {
>>   if (next_to_complete - next_to_start == MAX_IN_PROGRESS) {
>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>     objectstore->wait_for_complete(op.token);
>>   }
>>   for (; next_to_complete < next_to_start; ++next_to_complete) {
>>     InProgressOp &op = in_progress[next_to_complete % MAX_IN_PROGRESS];
>>     if (objectstore->is_operation_complete(op.token)) {
>>       PGRef pg = op.pg;
>>       OpContext *ctx = op.ctx;
>>       op.pg = PGRef();
>>       op.ctx = nullptr;
>>       objectstore->reset_operation_token(op.token);
>>       if (pg->continue_op(
>>             ctx, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>               == -EAGAIN) {
>>         ++next_to_start;
>>         continue;
>>       }
>>     } else {
>>       break;
>>     }
>>   }
>>   pair<OpRequestRef, PGRef> dq = // get new request from queue;
>>   if (dq.second->do_op(
>>         dq.first, &in_progress_ops[next_to_start % MAX_IN_PROGRESS])
>>           == -EAGAIN) {
>>     ++next_to_start;
>>   }
>> }
>>
>> A design like this would allow the op thread to move onto another task if the
>> objectstore implementation wants to perform an async operation.  For this
>> to work, there is some work to be done:
>>
>> 1. All current reads in the read and write paths (probably including the attr
>>    reads in get_object_context and friends) need to be able to handle getting
>>    -EAGAIN from the objectstore.
>
> Can we leave the old read methods in place as blocking versions, and have
> them block on the token before returning?  That'll make the transition
> less painful.
>
>> 2. Writes and reads need to be able to handle having the pg lock dropped
>>    during the operation.  This should be ok since the actual object information
>>    is protected by the RWState locks.
>
> All of the async write pieces already handle this (recheck PG state after
> taking the lock).  If they don't get -EAGAIN they'd just call the next
> stage, probably with a flag indicating that validation can be skipped
> (since the lock hasn't been dropped)?
>
>> 3. OpContext needs to have enough information to pick up where the operation
>>    left off.  This suggests that we should obtain all required ObjectContexts
>>    at the beginning of the operation.  Cache/Tiering complicates this.
>
> Yeah...
>
>> 4. The object class interface will need to be replaced with a new interface
>>    based on possibly async reads.  We can maintain compatibility with the
>>    current ones by launching a new thread to handle any message which happens
>>    to contain an old-style object class operation.
>
> Again, for now, wrappers would avoid this?
>
> s
>>
>> Most of this needs to happen to support object class operations on ec pools
>> anyway.
>> -Sam
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html