Re: crimson-osd queues discussion

kefu chai <tchaikov@xxxxxxxxx> · Wed, 13 Mar 2019 15:16:24 +0800

On Wed, Mar 13, 2019 at 2:31 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Fri, Mar 8, 2019 at 6:23 PM kefu chai <tchaikov@xxxxxxxxx> wrote:
> >
> > On Wed, Mar 6, 2019 at 3:49 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > >
> > > I realize I'm a bit late here, but I had some thoughts I wanted to get
> > > out as well...
> > >
> > > On Wed, Feb 20, 2019 at 7:54 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > > >
> > > > On Thu, 21 Feb 2019, Liu, Chunmei wrote:
> > > > > Hi all,
> > > > >
> > > > >   Here we want to discuss ceph-osd multiple queues and how can we
> > > > > implement crimson-osd more efficient with or without these queues.
> > > > >
> > > > >   We noticed there are multiple places for enqueue operations in current
> > > > > ceph-osd for a request when some preconditions are not satisfied such as
> > > > > session->waiting_on_map(waiting for map), slot->waiting(waiting for pg),
> > > > > waiting_for/map/peered/active/flush/scrub/** etc in pg.h, we need hold
> > > > > the request in these waiting queues, when some certain precondition is
> > > > > satisfied these enqueued request will be dequeued and enqueue front to
> > > > > work queue again to go through all the precondition checks from the
> > > > > beginning.
> > > > >
> > > > >   1. is it necessary to go through all the precondition checks again
> > > > > from the beginning or we can continue from the blocked check?
> > > >
> > > > Look at PG.h line ~1303 or so for a summary of the various queues.  It's a
> > > > mix: about half of them block and then stop blocking, never to block
> > > > again, until a new peering interval.  The others can start/stop blocking
> > > > at any time.
> > > >
> > > > I think this means that we should repeat all of the precondition checks.
> > > >
> > > > >    Crimson-osd is based on seastar framewok and use
> > > > > future/promise/continue chains, when a task's precondition is not
> > > > > satisfied at now it will return a future immediately and when promise
> > > > > fulfill the future, the continue task will be push to task queue of
> > > > > seastar reactor to schedule.  In this case we still need hold a queue
> > > > > for each precondition to keep track of pending futures, when some
> > > > > precondition is satisfied to call the waiting futures' promise to
> > > > > fulfill the future.
> > > > >
> > > > >    2. We have two choice here: a). use application its own queue to do
> > > > > request schedule just like the current ceph-osd (enqueue/dequeue request
> > > > > from one queue to another when precondition is not satisfied), in this
> > > > > case seastar reactor task scheduler is not involved in b). Use seastar
> > > > > reactor task queue, in this case use future/promise/continue model when
> > > > > precondition is not satisfied, let seastar reactor do schedule (also
> > > > > need application queues for tracking pending futures)
> > > > >      From our crimson-messenger experience, for some simple repeat
> > > > > action such as send-message, seems application queue is more effective
> > > > > than seastar reactor task queue.  We are not sure for osd/pg this kind
> > > > > of complex case, if it is still more effective.
> > > > >     Which one is better for crimson-osd?
> > > >
> > > > My gut says this will make for more robust code anyway to use an
> > > > application queue, and the blocking is relatively rare, so I wouldn't
> > > > worry about the overhead of repeating those checks.  But... I don't have
> > > > any experience or intuition around what makes sense in the future/promise
> > > > style of things.  :/
> > >
> > > I'm actually on the other side of this fence. The queues are fairly
> > > stable now, but getting them to that point took a long time and
> > > maintaining them correctly is still one of the most finicky parts of
> > > making real changes in the OSD code. They are a big piece of "global"
> > > state that don't show up in many places but are absolutely critical to
> > > maintain correctly, so it's hard for developers to learn the rules
> > > about them AND easy to miss that they need to be considered when
> > > making otherwise-unrelated changes.
> > > I was very much hoping that we could turn all of that explicit mapping
> > > into implicit dependency chains that validate precondition (or pause
> > > until they are satisfied) using futures that can be handled by the
> > > reactor and otherwise only need to be considered at the point where
> > > they are asserted, rather than later on at random places in the code.
> > > I *think* this is feasible?
> >
> > yes, i am also in favor of this. see
> > https://github.com/ceph/ceph/pull/26697/commits/81e906d82d9e04ebe5b8b230d424b300ebff2f93
> > and https://github.com/ceph/ceph/pull/26697/commits/d64c7c8022cacfc787231cfa61d9ea0fdcc58013#diff-1449683df2509676ff6b4977eff7e74bR660
> > for examples . to chain the producer and consumer in the same place
> > helps with the readability and probably helps with the performance.
>
> These links have gone stale in a rebase; can you share the commit titles?

sure, they are
- "crimson/osd: wait osdmap before processing peering evt"
- "crimson/osd/pg: wait until pg is active"
in https://github.com/ceph/ceph/pull/26697

> Browsing some of them, I see that "crimson/osd: wait osdmap before
> processing peering evt" has added an explicit "waiting_peering"
> application queue that I presume we have to maintain just like all the
> others that exist in the classic code.

to be specific, it is a map of shared_promise, ideally any
request/event expecting an osdmap not available yet should be waiting
for a future returned by the promise stored in this map. once a
new-enough map is consumed, all waiters of maps older than that map
will be awoken and continue with whatever they were doing. currently,
only peering messages are using this facility. once there are more
consumers, probably we should rename it to osdmap_promises.

> -Greg

>
> >
> > >
> > > There is a bit of a challenge to this when debugging blocked ops, but
> > > I presume we'll need to develop a robust way of checking reactor
> > > dependency chains anyway so I don't think it should be any worse than
> > > if we had to build up debugging around all the queues.
> >
> > ahh, this is a good point. i never thought about a way to check
> > dependency chain. this would need a probe touching the innards of
> > reactor.
> >
> > >
> > > >
> > > > >    3. For QOS, do we have to use some application queue to implement
> > > > > Qos? Means we can't avoid application queue for QOS?
> > > >
> > > > Yeah, I think we'll need the app queue for this anyway!
> > >
> > > It would depend on exactly what functionality we need, but I don't
> > > think this is accurate. We can chain futures such that we wait for the
> > > previous client op to complete, then wait on a timer, if we are just
> > > limiting it to an absolute IOP rate. dmclock is a little harder but we
> > > can also do futures that are satisfied by the completion of a single
> > > "child" future, which would let us combine many different conditions
> > > together and probably build that model.
> > > -Greg
> >
> >
> >
> > --
> > Regards
> > Kefu Chai

-- 
Regards
Kefu Chai