> Op 1 december 2016 om 23:10 schreef Sage Weil <sage@xxxxxxxxxxxx>: > > > Hi Bartek, > > On Thu, 1 Dec 2016, Bartłomiej Święcki wrote: > > We're currently being hit by an issue with cluster recovery. The cluster size > > has been significantly extended (~50% new OSDs) and started recovery. > > During recovery there was a HW failure and we ended up with some PGS in peered > > state with size < min_size (inactive). > > Those peered PGs are waiting for backfill but the cluster still prefers > > recovery of recovery_wait PGs - in our case this could be even few hours > > before all recovery is finished (we're speeding up recovery up to limits to > > get the downtime as short as possible). Those peered PGs are blocked during > > this time and the whole cluster just struggles to operate at a reasonable > > level. > > > > We're running hammer 0.94.6 there and from the code it looks like recovery > > will always have higher priority (jewel seems similar). > > Documentation only says that log-based recovery must finish before backfills. > > Is this requirement needed for data consistency or something else? > > For a given single PG, it will do it's own log recovery (to bring acting > OSDs fully up to date) before starting backfill, but between PGs there's > no dependency. > > > Ideally we'd like it to be this order: undersized inactive (size < min_size) > > recovery_wait => undersized inactive (size < min_size) wait_backfill => > > degraded recovery_wait => degraded wait_backfill => remapped wait_backfill. > > Changing priority calculation doesn't seem to be that hard but would it end up > > with inconsistent data? > > We could definitely improve this, yeah. The prioritization code is based > around PG::get_{recovery,backfill}_priority(), and the > OSD_{RECOVERY,BACKFILL}_* #defines in PG.h. It currently assumes (log) > recovery is always higher priority than backfill, but as you say we can do > better. As long as everything maps into a 0..255 priority value it should > be fine. > > Getting the undersized inactive bumped to the top should be a simple tweak > (just force a top-value priority in that case)... Would be very, very welcome. inactive PGs should always get priority imho. I find it difficult to explain this to people why Ceph doesn't already do so. I've seen this happen multiple times. Would this be just a fix in PG.cc in get_backfill_priority and get_recovery_priority? it seems so. Wido > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html