osd reservation woes

Sage Weil <sweil@xxxxxxxxxx> · Tue, 13 Mar 2018 15:19:35 +0000 (UTC)

Current situation:

Currently each OSD has two AsyncReserver instances, local_reserver and 
remote_reserver.  For any reservation for PG background work (e.g., 
backfill or recovery), we first reserve from teh local_reserver, and once 
we get that slot we reserve from the remote_reserver form all non-primary 
osds.  Then we do the work.

There are a few problesm with this approach:

- The effective minimum for background work is always 2, not 1, since we 
always have at least one "local" slot and one "remote" slot, and ideally 
we'd like to have a single background task running per osd.

- The reservations are taking in a strict order to prevent deadlock, 
which means that we often spend lots of time waiting on a single busy OSD 
when there is other work we could be doing that is not blocked.  Having 
some backoff/retry behavior should allow us to make better progress 
overall.

- The reservers are super simple and the PG state machines that deal 
with the two servers are super complicated and hard to follow.  We keep 
uncovering bugs where the combination of preemption (locally or remotely) 
races with recovery completion and leads to an unexpected state machine 
event.  Latest examples are
	https://github.com/ceph/ceph/pull/20837
and
	http://tracker.ceph.com/issues/22902

- Scrub uses a totally independent implementation for it's scheduling that 
*mostly* plays nice with these reservations (but not completely).

I think longer term we want a different structure: a reserveration/locking 
service on each OSD that will handle the reservation/locking both locally 
and remotely and provide a single set of notification events (reserved, 
preempted) to the PG state machine.  This would eliminate the complexity 
in the state machines and vastly simplify that code, making it easier to 
understand and follow.  It would also put all of the reservation behavior 
(including the messaging between peers) in a single module where it can be 
more easily understood.

It would be an incompatible change because we'll need a new, generic 
'reservation' message that is used by the new reserver service.  If we 
implement backoff/retry, then we can also drop the ad hoc scrub 
reservation code as well.

This is a bigger project to implement, but I think it will resolve all of 
the little bugs we're seeing in the current state machines.  I'm hoping 
that we can patch those on an individual basis to get us by long enough to 
do the larger cleanup...?

If anyone is interesting in tackling this problem, that would be awesome!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html