Re: osd reservation woes

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 13 Mar 2018 13:40:08 -0700

On Tue, Mar 13, 2018 at 8:19 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> Current situation:
>
> Currently each OSD has two AsyncReserver instances, local_reserver and
> remote_reserver.  For any reservation for PG background work (e.g.,
> backfill or recovery), we first reserve from teh local_reserver, and once
> we get that slot we reserve from the remote_reserver form all non-primary
> osds.  Then we do the work.
>
> There are a few problesm with this approach:
>
> - The effective minimum for background work is always 2, not 1, since we
> always have at least one "local" slot and one "remote" slot, and ideally
> we'd like to have a single background task running per osd.
>
> - The reservations are taking in a strict order to prevent deadlock,
> which means that we often spend lots of time waiting on a single busy OSD
> when there is other work we could be doing that is not blocked.  Having
> some backoff/retry behavior should allow us to make better progress
> overall.
>
> - The reservers are super simple and the PG state machines that deal
> with the two servers are super complicated and hard to follow.  We keep
> uncovering bugs where the combination of preemption (locally or remotely)
> races with recovery completion and leads to an unexpected state machine
> event.  Latest examples are
>         https://github.com/ceph/ceph/pull/20837
> and
>         http://tracker.ceph.com/issues/22902

So are these issues all a result of the newish preemption mechanisms?

>
> - Scrub uses a totally independent implementation for it's scheduling that
> *mostly* plays nice with these reservations (but not completely).

Can you discuss the boundaries here? I know they use the same basic
Reservers mechanism but I don't recall how they coordinate and don't
understand what might not play nicely.

>
>
> I think longer term we want a different structure: a reserveration/locking
> service on each OSD that will handle the reservation/locking both locally
> and remotely and provide a single set of notification events (reserved,
> preempted) to the PG state machine.  This would eliminate the complexity
> in the state machines and vastly simplify that code, making it easier to
> understand and follow.  It would also put all of the reservation behavior
> (including the messaging between peers) in a single module where it can be
> more easily understood.

Hmm, I'm not quite sure how this would work. It seems like the
reservation state really needs to coordinate pretty closely with the
PG. eg, we have to decrease our priority if we go from being
undersized to merely misplaced. Or maybe you just mean that things are
overly complicated because there *isn't* a defined interface for how
the PGs interact between each other?

>
> It would be an incompatible change because we'll need a new, generic
> 'reservation' message that is used by the new reserver service.  If we
> implement backoff/retry, then we can also drop the ad hoc scrub
> reservation code as well.
>
> This is a bigger project to implement, but I think it will resolve all of
> the little bugs we're seeing in the current state machines.  I'm hoping
> that we can patch those on an individual basis to get us by long enough to
> do the larger cleanup...?
>
> If anyone is interesting in tackling this problem, that would be awesome!

Is there a feature ticket or a Trello item for this? ;)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html