A common complaint is that recovery/backfill/rebalancing has a high impact. That isn't news. What I realized this week after hearing more operators describe their workaround is that everybody's workaround is roughly the same: make small changes to the crush map so that only a small number of PGs are backfilling at a time. In retrospect it seems obvious, but the problem is that our backfill throttling is per-OSD: the "slowest" we can go is 1 backfilling PG per OSD. (Actually, 2.. one primary and one replica due to separate reservation thresholds to avoid deadlock.) That means that every OSD is impacted. Doing fewer PGs doesn't make the recovery vs client scheduling better, but it means it affects fewer PGs and fewer client IOs and the net observed impact is smaller. Anyway, in short, I think we need to be able to set a *global* threshold of "no more than X % of OSDs should be backfilling at a time," which is impossible given the current reservation appoach. This could be done naively by having OSDs reserve a slot via the mon or mgr. If we only did it for backfill the impact should be minimal (those are big slow long-running operations already). I think you can *almost* do it cleverly by inferring the set of PGs that have to backfill by pg_temp. However, that doesn't take any priority or stuck PGs into consideration. Anyway, the naive thing probably isn't so bad... 1) PGMap counts backfilling PGs per OSD (and then the number of OSDs with one or more backfilling PGs). 2) For the first step of the backfill (recovery?) reservation, OSDs ask the mgr for a reservation slot. The reservation is (pgid,interval epoch) so that the mgr can throw out the reservation require without needing an explicit cancellation if there is an interval change. 3) mgr grants as many reservations as it can without (backfilling + grants) > whatever the max is. We can set the max with a global tunable like max_osd_backfilling_ratio = .3 so that only 30% of the osds can be backfilling at once? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html