Current situation: Currently each OSD has two AsyncReserver instances, local_reserver and remote_reserver. For any reservation for PG background work (e.g., backfill or recovery), we first reserve from teh local_reserver, and once we get that slot we reserve from the remote_reserver form all non-primary osds. Then we do the work. There are a few problesm with this approach: - The effective minimum for background work is always 2, not 1, since we always have at least one "local" slot and one "remote" slot, and ideally we'd like to have a single background task running per osd. - The reservations are taking in a strict order to prevent deadlock, which means that we often spend lots of time waiting on a single busy OSD when there is other work we could be doing that is not blocked. Having some backoff/retry behavior should allow us to make better progress overall. - The reservers are super simple and the PG state machines that deal with the two servers are super complicated and hard to follow. We keep uncovering bugs where the combination of preemption (locally or remotely) races with recovery completion and leads to an unexpected state machine event. Latest examples are https://github.com/ceph/ceph/pull/20837 and http://tracker.ceph.com/issues/22902 - Scrub uses a totally independent implementation for it's scheduling that *mostly* plays nice with these reservations (but not completely). I think longer term we want a different structure: a reserveration/locking service on each OSD that will handle the reservation/locking both locally and remotely and provide a single set of notification events (reserved, preempted) to the PG state machine. This would eliminate the complexity in the state machines and vastly simplify that code, making it easier to understand and follow. It would also put all of the reservation behavior (including the messaging between peers) in a single module where it can be more easily understood. It would be an incompatible change because we'll need a new, generic 'reservation' message that is used by the new reserver service. If we implement backoff/retry, then we can also drop the ad hoc scrub reservation code as well. This is a bigger project to implement, but I think it will resolve all of the little bugs we're seeing in the current state machines. I'm hoping that we can patch those on an individual basis to get us by long enough to do the larger cleanup...? If anyone is interesting in tackling this problem, that would be awesome! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html