Re: crimson-osd queues discussion

kefu chai <tchaikov@xxxxxxxxx> · Sat, 9 Mar 2019 10:23:14 +0800

On Wed, Mar 6, 2019 at 3:49 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> I realize I'm a bit late here, but I had some thoughts I wanted to get
> out as well...
>
> On Wed, Feb 20, 2019 at 7:54 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >
> > On Thu, 21 Feb 2019, Liu, Chunmei wrote:
> > > Hi all,
> > >
> > >   Here we want to discuss ceph-osd multiple queues and how can we
> > > implement crimson-osd more efficient with or without these queues.
> > >
> > >   We noticed there are multiple places for enqueue operations in current
> > > ceph-osd for a request when some preconditions are not satisfied such as
> > > session->waiting_on_map(waiting for map), slot->waiting(waiting for pg),
> > > waiting_for/map/peered/active/flush/scrub/** etc in pg.h, we need hold
> > > the request in these waiting queues, when some certain precondition is
> > > satisfied these enqueued request will be dequeued and enqueue front to
> > > work queue again to go through all the precondition checks from the
> > > beginning.
> > >
> > >   1. is it necessary to go through all the precondition checks again
> > > from the beginning or we can continue from the blocked check?
> >
> > Look at PG.h line ~1303 or so for a summary of the various queues.  It's a
> > mix: about half of them block and then stop blocking, never to block
> > again, until a new peering interval.  The others can start/stop blocking
> > at any time.
> >
> > I think this means that we should repeat all of the precondition checks.
> >
> > >    Crimson-osd is based on seastar framewok and use
> > > future/promise/continue chains, when a task's precondition is not
> > > satisfied at now it will return a future immediately and when promise
> > > fulfill the future, the continue task will be push to task queue of
> > > seastar reactor to schedule.  In this case we still need hold a queue
> > > for each precondition to keep track of pending futures, when some
> > > precondition is satisfied to call the waiting futures' promise to
> > > fulfill the future.
> > >
> > >    2. We have two choice here: a). use application its own queue to do
> > > request schedule just like the current ceph-osd (enqueue/dequeue request
> > > from one queue to another when precondition is not satisfied), in this
> > > case seastar reactor task scheduler is not involved in b). Use seastar
> > > reactor task queue, in this case use future/promise/continue model when
> > > precondition is not satisfied, let seastar reactor do schedule (also
> > > need application queues for tracking pending futures)
> > >      From our crimson-messenger experience, for some simple repeat
> > > action such as send-message, seems application queue is more effective
> > > than seastar reactor task queue.  We are not sure for osd/pg this kind
> > > of complex case, if it is still more effective.
> > >     Which one is better for crimson-osd?
> >
> > My gut says this will make for more robust code anyway to use an
> > application queue, and the blocking is relatively rare, so I wouldn't
> > worry about the overhead of repeating those checks.  But... I don't have
> > any experience or intuition around what makes sense in the future/promise
> > style of things.  :/
>
> I'm actually on the other side of this fence. The queues are fairly
> stable now, but getting them to that point took a long time and
> maintaining them correctly is still one of the most finicky parts of
> making real changes in the OSD code. They are a big piece of "global"
> state that don't show up in many places but are absolutely critical to
> maintain correctly, so it's hard for developers to learn the rules
> about them AND easy to miss that they need to be considered when
> making otherwise-unrelated changes.
> I was very much hoping that we could turn all of that explicit mapping
> into implicit dependency chains that validate precondition (or pause
> until they are satisfied) using futures that can be handled by the
> reactor and otherwise only need to be considered at the point where
> they are asserted, rather than later on at random places in the code.
> I *think* this is feasible?

yes, i am also in favor of this. see
https://github.com/ceph/ceph/pull/26697/commits/81e906d82d9e04ebe5b8b230d424b300ebff2f93
and https://github.com/ceph/ceph/pull/26697/commits/d64c7c8022cacfc787231cfa61d9ea0fdcc58013#diff-1449683df2509676ff6b4977eff7e74bR660
for examples . to chain the producer and consumer in the same place
helps with the readability and probably helps with the performance.

>
> There is a bit of a challenge to this when debugging blocked ops, but
> I presume we'll need to develop a robust way of checking reactor
> dependency chains anyway so I don't think it should be any worse than
> if we had to build up debugging around all the queues.

ahh, this is a good point. i never thought about a way to check
dependency chain. this would need a probe touching the innards of
reactor.

>
> >
> > >    3. For QOS, do we have to use some application queue to implement
> > > Qos? Means we can't avoid application queue for QOS?
> >
> > Yeah, I think we'll need the app queue for this anyway!
>
> It would depend on exactly what functionality we need, but I don't
> think this is accurate. We can chain futures such that we wait for the
> previous client op to complete, then wait on a timer, if we are just
> limiting it to an absolute IOP rate. dmclock is a little harder but we
> can also do futures that are satisfied by the completion of a single
> "child" future, which would let us combine many different conditions
> together and probably build that model.
> -Greg

-- 
Regards
Kefu Chai