Re: osd reservation woes

Sage Weil <sweil@xxxxxxxxxx> · Tue, 13 Mar 2018 21:19:54 +0000 (UTC)

On Tue, 13 Mar 2018, Gregory Farnum wrote:
> On Tue, Mar 13, 2018 at 8:19 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > Current situation:
> >
> > Currently each OSD has two AsyncReserver instances, local_reserver and
> > remote_reserver.  For any reservation for PG background work (e.g.,
> > backfill or recovery), we first reserve from teh local_reserver, and once
> > we get that slot we reserve from the remote_reserver form all non-primary
> > osds.  Then we do the work.
> >
> > There are a few problesm with this approach:
> >
> > - The effective minimum for background work is always 2, not 1, since we
> > always have at least one "local" slot and one "remote" slot, and ideally
> > we'd like to have a single background task running per osd.
> >
> > - The reservations are taking in a strict order to prevent deadlock,
> > which means that we often spend lots of time waiting on a single busy OSD
> > when there is other work we could be doing that is not blocked.  Having
> > some backoff/retry behavior should allow us to make better progress
> > overall.
> >
> > - The reservers are super simple and the PG state machines that deal
> > with the two servers are super complicated and hard to follow.  We keep
> > uncovering bugs where the combination of preemption (locally or remotely)
> > races with recovery completion and leads to an unexpected state machine
> > event.  Latest examples are
> >         https://github.com/ceph/ceph/pull/20837
> > and
> >         http://tracker.ceph.com/issues/22902
> 
> So are these issues all a result of the newish preemption mechanisms?

Pretty much, yeah.  :(

> > - Scrub uses a totally independent implementation for it's scheduling that
> > *mostly* plays nice with these reservations (but not completely).
> 
> Can you discuss the boundaries here? I know they use the same basic
> Reservers mechanism but I don't recall how they coordinate and don't
> understand what might not play nicely.

  if (!cct->_conf->osd_scrub_during_recovery && service.is_recovery_active()) {
    dout(20) << __func__ << " not scheduling scrubs due to active recovery" << dendl;
    return;
  }

in OSD::sched_scrub() and checks like

  if ((cct->_conf->osd_scrub_during_recovery || !osd->is_recovery_active()) &&
      osd->inc_scrubs_pending()) {
    scrubber.reserved = true;
  } else {
    dout(20) << __func__ << ": failed to reserve remotely" << dendl;
    scrubber.reserved = false;
  }

in PG::handle_scrub_reserve_request() is pretty much the extent of it.

> > I think longer term we want a different structure: a reserveration/locking
> > service on each OSD that will handle the reservation/locking both locally
> > and remotely and provide a single set of notification events (reserved,
> > preempted) to the PG state machine.  This would eliminate the complexity
> > in the state machines and vastly simplify that code, making it easier to
> > understand and follow.  It would also put all of the reservation behavior
> > (including the messaging between peers) in a single module where it can be
> > more easily understood.
> 
> Hmm, I'm not quite sure how this would work. It seems like the
> reservation state really needs to coordinate pretty closely with the
> PG. eg, we have to decrease our priority if we go from being
> undersized to merely misplaced. Or maybe you just mean that things are
> overly complicated because there *isn't* a defined interface for how
> the PGs interact between each other?

When the PG priority changes it will cancel it's old request and request a 
new one.  That interface won't change.

What will change is that we'll get a callback when we get both the 
local+remote reservations instead of one when we get the local slot, and 
the PG won't implement the messages back and forth to request the remote 
reservations at all.  In the ReplicaActive portion of the PG state 
machine, we probably won't even know at all whether the PG is under 
recovery or not--those states are only there in order to drive the 
recovery reservations as far as I can tell (and don't enforce anything 
useful.. for example, we can pull needed objects from stray replicas 
without reserving anything).

I think the other piece that will change is that the reserver will have a 
key for the "what" that you're reserving.  Right now the spg_t is the only 
key, and the only users of the reserver are recovery/backfill so it's 
unambiguous what a grant or revoke means.  We'd add an enum for RECOVERY 
and SCRUB so that we know what got granted (and there'd be a separate 
priority for each).

> > It would be an incompatible change because we'll need a new, generic
> > 'reservation' message that is used by the new reserver service.  If we
> > implement backoff/retry, then we can also drop the ad hoc scrub
> > reservation code as well.
> >
> > This is a bigger project to implement, but I think it will resolve all of
> > the little bugs we're seeing in the current state machines.  I'm hoping
> > that we can patch those on an individual basis to get us by long enough to
> > do the larger cleanup...?
> >
> > If anyone is interesting in tackling this problem, that would be awesome!
> 
> Is there a feature ticket or a Trello item for this? ;)

I've added a trello card:

	https://trello.com/c/ppJKaJeT/331-osd-refactor-reserver

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html