Hi Bartek, On Thu, 1 Dec 2016, Bartłomiej Święcki wrote: > We're currently being hit by an issue with cluster recovery. The cluster size > has been significantly extended (~50% new OSDs) and started recovery. > During recovery there was a HW failure and we ended up with some PGS in peered > state with size < min_size (inactive). > Those peered PGs are waiting for backfill but the cluster still prefers > recovery of recovery_wait PGs - in our case this could be even few hours > before all recovery is finished (we're speeding up recovery up to limits to > get the downtime as short as possible). Those peered PGs are blocked during > this time and the whole cluster just struggles to operate at a reasonable > level. > > We're running hammer 0.94.6 there and from the code it looks like recovery > will always have higher priority (jewel seems similar). > Documentation only says that log-based recovery must finish before backfills. > Is this requirement needed for data consistency or something else? For a given single PG, it will do it's own log recovery (to bring acting OSDs fully up to date) before starting backfill, but between PGs there's no dependency. > Ideally we'd like it to be this order: undersized inactive (size < min_size) > recovery_wait => undersized inactive (size < min_size) wait_backfill => > degraded recovery_wait => degraded wait_backfill => remapped wait_backfill. > Changing priority calculation doesn't seem to be that hard but would it end up > with inconsistent data? We could definitely improve this, yeah. The prioritization code is based around PG::get_{recovery,backfill}_priority(), and the OSD_{RECOVERY,BACKFILL}_* #defines in PG.h. It currently assumes (log) recovery is always higher priority than backfill, but as you say we can do better. As long as everything maps into a 0..255 priority value it should be fine. Getting the undersized inactive bumped to the top should be a simple tweak (just force a top-value priority in that case)... sage