Re: crimson-osd queues discussion

kefu chai <tchaikov@xxxxxxxxx> · Thu, 21 Feb 2019 19:36:25 +0800

On Thu, Feb 21, 2019 at 11:57 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> On Thu, 21 Feb 2019, Liu, Chunmei wrote:
> > Hi all,
> >
> >   Here we want to discuss ceph-osd multiple queues and how can we
> > implement crimson-osd more efficient with or without these queues.
> >
> >   We noticed there are multiple places for enqueue operations in current
> > ceph-osd for a request when some preconditions are not satisfied such as
> > session->waiting_on_map(waiting for map), slot->waiting(waiting for pg),
> > waiting_for/map/peered/active/flush/scrub/** etc in pg.h, we need hold
> > the request in these waiting queues, when some certain precondition is
> > satisfied these enqueued request will be dequeued and enqueue front to
> > work queue again to go through all the precondition checks from the
> > beginning.
> >
> >   1. is it necessary to go through all the precondition checks again
> > from the beginning or we can continue from the blocked check?
>
> Look at PG.h line ~1303 or so for a summary of the various queues.  It's a
> mix: about half of them block and then stop blocking, never to block
> again, until a new peering interval.  The others can start/stop blocking
> at any time.
>
> I think this means that we should repeat all of the precondition checks.
>
> >    Crimson-osd is based on seastar framewok and use
> > future/promise/continue chains, when a task's precondition is not
> > satisfied at now it will return a future immediately and when promise
> > fulfill the future, the continue task will be push to task queue of
> > seastar reactor to schedule.  In this case we still need hold a queue
> > for each precondition to keep track of pending futures, when some
> > precondition is satisfied to call the waiting futures' promise to
> > fulfill the future.
> >
> >    2. We have two choice here: a). use application its own queue to do
> > request schedule just like the current ceph-osd (enqueue/dequeue request
> > from one queue to another when precondition is not satisfied), in this
> > case seastar reactor task scheduler is not involved in b). Use seastar
> > reactor task queue, in this case use future/promise/continue model when
> > precondition is not satisfied, let seastar reactor do schedule (also
> > need application queues for tracking pending futures)
> >      From our crimson-messenger experience, for some simple repeat
> > action such as send-message, seems application queue is more effective
> > than seastar reactor task queue.  We are not sure for osd/pg this kind
> > of complex case, if it is still more effective.
> >     Which one is better for crimson-osd?
>
> My gut says this will make for more robust code anyway to use an
> application queue, and the blocking is relatively rare, so I wouldn't
> worry about the overhead of repeating those checks.  But... I don't have
> any experience or intuition around what makes sense in the future/promise
> style of things.  :/
>
> >    3. For QOS, do we have to use some application queue to implement
> > Qos? Means we can't avoid application queue for QOS?
>
> Yeah, I think we'll need the app queue for this anyway!

a straightforward translation from the existing model is like,
 - each connection to rados client will push the decoded requests to a
queue with QoS support. queue.push_back() will be blocked if the QoS
policy asks the client to backoff or wait. see also
seastar::queue::push_eventually(). and the fiber will also likely to
be blocked when it's trying to read more messages from client.
 - each objectstorage or pg backend will have a loop keeping grabbing
requests from its queue, then process it, and reply to the client with
the result, until the osd instance is asked to stop or the PG is
deleted. this fiber will be blocked if there is no pending requests in
the queue.

yeah, probably we cannot avoid queue/bucket when implementing a proper
QoS, but i think we can have a futurized queue so we can avoid the
enqueue/dequeue if the underlying "worker thread" is fast enough to
consume the request *immediately*. put in other way, is it possible to
avoid the fiber context switch if we know that the queue is ready to
pop() the current request? namely, we can have a slightly different
implementation.

so the producer side will look like,

  do_until([this] { return _stopping; },
                [this] { return conn.read_request().then([this](auto req) {
                           // make_queueable() is a generic function
which extracts the weight/cost/priority from given request
                           return
queue.push_and_pop(make_queueable(req)).then([req,this] {
                             // if this req is lucky enough, it won't
need to wait even a jiffy before being served.
                             return do_request(req);
                           }).then[conn,this](auto resp) {
                             return conn.send_response(resp);
                           }); });

and the above will be the client's connection's start() method. if we
choose to have an optimistic queue, where a request will be handled
immediately without being blocked in the best case. but on the
downside, all the pending clients will need to wrap the pending
request in a seastar task, and wait on its promise. and the queue will
need to keep tracking all the pending futures of the pending client
requests.

-- 
Regards
Kefu Chai