Re: crimson-osd queues discussion

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 12 Mar 2019 11:31:29 -0700



On Fri, Mar 8, 2019 at 6:23 PM kefu chai <tchaikov@xxxxxxxxx> wrote:
>
> On Wed, Mar 6, 2019 at 3:49 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >
> > I realize I'm a bit late here, but I had some thoughts I wanted to get
> > out as well...
> >
> > On Wed, Feb 20, 2019 at 7:54 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > >
> > > On Thu, 21 Feb 2019, Liu, Chunmei wrote:
> > > > Hi all,
> > > >
> > > >   Here we want to discuss ceph-osd multiple queues and how can we
> > > > implement crimson-osd more efficient with or without these queues.
> > > >
> > > >   We noticed there are multiple places for enqueue operations in current
> > > > ceph-osd for a request when some preconditions are not satisfied such as
> > > > session->waiting_on_map(waiting for map), slot->waiting(waiting for pg),
> > > > waiting_for/map/peered/active/flush/scrub/** etc in pg.h, we need hold
> > > > the request in these waiting queues, when some certain precondition is
> > > > satisfied these enqueued request will be dequeued and enqueue front to
> > > > work queue again to go through all the precondition checks from the
> > > > beginning.
> > > >
> > > >   1. is it necessary to go through all the precondition checks again
> > > > from the beginning or we can continue from the blocked check?
> > >
> > > Look at PG.h line ~1303 or so for a summary of the various queues.  It's a
> > > mix: about half of them block and then stop blocking, never to block
> > > again, until a new peering interval.  The others can start/stop blocking
> > > at any time.
> > >
> > > I think this means that we should repeat all of the precondition checks.
> > >
> > > >    Crimson-osd is based on seastar framewok and use
> > > > future/promise/continue chains, when a task's precondition is not
> > > > satisfied at now it will return a future immediately and when promise
> > > > fulfill the future, the continue task will be push to task queue of
> > > > seastar reactor to schedule.  In this case we still need hold a queue
> > > > for each precondition to keep track of pending futures, when some
> > > > precondition is satisfied to call the waiting futures' promise to
> > > > fulfill the future.
> > > >
> > > >    2. We have two choice here: a). use application its own queue to do
> > > > request schedule just like the current ceph-osd (enqueue/dequeue request
> > > > from one queue to another when precondition is not satisfied), in this
> > > > case seastar reactor task scheduler is not involved in b). Use seastar
> > > > reactor task queue, in this case use future/promise/continue model when
> > > > precondition is not satisfied, let seastar reactor do schedule (also
> > > > need application queues for tracking pending futures)
> > > >      From our crimson-messenger experience, for some simple repeat
> > > > action such as send-message, seems application queue is more effective
> > > > than seastar reactor task queue.  We are not sure for osd/pg this kind
> > > > of complex case, if it is still more effective.
> > > >     Which one is better for crimson-osd?
> > >
> > > My gut says this will make for more robust code anyway to use an
> > > application queue, and the blocking is relatively rare, so I wouldn't
> > > worry about the overhead of repeating those checks.  But... I don't have
> > > any experience or intuition around what makes sense in the future/promise
> > > style of things.  :/
> >
> > I'm actually on the other side of this fence. The queues are fairly
> > stable now, but getting them to that point took a long time and
> > maintaining them correctly is still one of the most finicky parts of
> > making real changes in the OSD code. They are a big piece of "global"
> > state that don't show up in many places but are absolutely critical to
> > maintain correctly, so it's hard for developers to learn the rules
> > about them AND easy to miss that they need to be considered when
> > making otherwise-unrelated changes.
> > I was very much hoping that we could turn all of that explicit mapping
> > into implicit dependency chains that validate precondition (or pause
> > until they are satisfied) using futures that can be handled by the
> > reactor and otherwise only need to be considered at the point where
> > they are asserted, rather than later on at random places in the code.
> > I *think* this is feasible?
>
> yes, i am also in favor of this. see
> https://github.com/ceph/ceph/pull/26697/commits/81e906d82d9e04ebe5b8b230d424b300ebff2f93
> and https://github.com/ceph/ceph/pull/26697/commits/d64c7c8022cacfc787231cfa61d9ea0fdcc58013#diff-1449683df2509676ff6b4977eff7e74bR660
> for examples . to chain the producer and consumer in the same place
> helps with the readability and probably helps with the performance.

These links have gone stale in a rebase; can you share the commit titles?
Browsing some of them, I see that "crimson/osd: wait osdmap before
processing peering evt" has added an explicit "waiting_peering"
application queue that I presume we have to maintain just like all the
others that exist in the classic code.
-Greg

>
> >
> > There is a bit of a challenge to this when debugging blocked ops, but
> > I presume we'll need to develop a robust way of checking reactor
> > dependency chains anyway so I don't think it should be any worse than
> > if we had to build up debugging around all the queues.
>
> ahh, this is a good point. i never thought about a way to check
> dependency chain. this would need a probe touching the innards of
> reactor.
>
> >
> > >
> > > >    3. For QOS, do we have to use some application queue to implement
> > > > Qos? Means we can't avoid application queue for QOS?
> > >
> > > Yeah, I think we'll need the app queue for this anyway!
> >
> > It would depend on exactly what functionality we need, but I don't
> > think this is accurate. We can chain futures such that we wait for the
> > previous client op to complete, then wait on a timer, if we are just
> > limiting it to an absolute IOP rate. dmclock is a little harder but we
> > can also do futures that are satisfied by the completion of a single
> > "child" future, which would let us combine many different conditions
> > together and probably build that model.
> > -Greg
>
>
>
> --
> Regards
> Kefu Chai