Re: crimson-osd queues discussion

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 5 Mar 2019 11:37:52 -0800

I realize I'm a bit late here, but I had some thoughts I wanted to get
out as well...

On Wed, Feb 20, 2019 at 7:54 PM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> On Thu, 21 Feb 2019, Liu, Chunmei wrote:
> > Hi all,
> >
> >   Here we want to discuss ceph-osd multiple queues and how can we
> > implement crimson-osd more efficient with or without these queues.
> >
> >   We noticed there are multiple places for enqueue operations in current
> > ceph-osd for a request when some preconditions are not satisfied such as
> > session->waiting_on_map(waiting for map), slot->waiting(waiting for pg),
> > waiting_for/map/peered/active/flush/scrub/** etc in pg.h, we need hold
> > the request in these waiting queues, when some certain precondition is
> > satisfied these enqueued request will be dequeued and enqueue front to
> > work queue again to go through all the precondition checks from the
> > beginning.
> >
> >   1. is it necessary to go through all the precondition checks again
> > from the beginning or we can continue from the blocked check?
>
> Look at PG.h line ~1303 or so for a summary of the various queues.  It's a
> mix: about half of them block and then stop blocking, never to block
> again, until a new peering interval.  The others can start/stop blocking
> at any time.
>
> I think this means that we should repeat all of the precondition checks.
>
> >    Crimson-osd is based on seastar framewok and use
> > future/promise/continue chains, when a task's precondition is not
> > satisfied at now it will return a future immediately and when promise
> > fulfill the future, the continue task will be push to task queue of
> > seastar reactor to schedule.  In this case we still need hold a queue
> > for each precondition to keep track of pending futures, when some
> > precondition is satisfied to call the waiting futures' promise to
> > fulfill the future.
> >
> >    2. We have two choice here: a). use application its own queue to do
> > request schedule just like the current ceph-osd (enqueue/dequeue request
> > from one queue to another when precondition is not satisfied), in this
> > case seastar reactor task scheduler is not involved in b). Use seastar
> > reactor task queue, in this case use future/promise/continue model when
> > precondition is not satisfied, let seastar reactor do schedule (also
> > need application queues for tracking pending futures)
> >      From our crimson-messenger experience, for some simple repeat
> > action such as send-message, seems application queue is more effective
> > than seastar reactor task queue.  We are not sure for osd/pg this kind
> > of complex case, if it is still more effective.
> >     Which one is better for crimson-osd?
>
> My gut says this will make for more robust code anyway to use an
> application queue, and the blocking is relatively rare, so I wouldn't
> worry about the overhead of repeating those checks.  But... I don't have
> any experience or intuition around what makes sense in the future/promise
> style of things.  :/

I'm actually on the other side of this fence. The queues are fairly
stable now, but getting them to that point took a long time and
maintaining them correctly is still one of the most finicky parts of
making real changes in the OSD code. They are a big piece of "global"
state that don't show up in many places but are absolutely critical to
maintain correctly, so it's hard for developers to learn the rules
about them AND easy to miss that they need to be considered when
making otherwise-unrelated changes.
I was very much hoping that we could turn all of that explicit mapping
into implicit dependency chains that validate precondition (or pause
until they are satisfied) using futures that can be handled by the
reactor and otherwise only need to be considered at the point where
they are asserted, rather than later on at random places in the code.
I *think* this is feasible?

There is a bit of a challenge to this when debugging blocked ops, but
I presume we'll need to develop a robust way of checking reactor
dependency chains anyway so I don't think it should be any worse than
if we had to build up debugging around all the queues.

>
> >    3. For QOS, do we have to use some application queue to implement
> > Qos? Means we can't avoid application queue for QOS?
>
> Yeah, I think we'll need the app queue for this anyway!

It would depend on exactly what functionality we need, but I don't
think this is accurate. We can chain futures such that we wait for the
previous client op to complete, then wait on a timer, if we are just
limiting it to an absolute IOP rate. dmclock is a little harder but we
can also do futures that are satisfied by the completion of a single
"child" future, which would let us combine many different conditions
together and probably build that model.
-Greg