I recently observed a problem on the lab cluster while doing a log of rebalancing (filestore->bluestore conversion): - lots of pgs in backfill_wait - a few pgs that need pg log recovery, but these appear after backfills are already in progress, so they end up in backfill_wait too (confusing state name!) - ongoing write activity extents pg logs for those pgs, but they cannot trim - pg logs reach 5x-10x the max - OSDs OOM I think what is needed is for the recovery priority scheduling to allow preemption. If we are currently working on recovery/backfill for PG X, but PG Y appears with a higher priority, we should suspend work on X and switch to Y. Piotr, I didn't look too closely at forced recovery changes you folks recently did, but I'm guessing that it was added to address this sort of situation, right? Would a general solution that preempts and always works on the highest priority PG resolve the problem you've observed? Thanks- sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html