Hi,
We're currently being hit by an issue with cluster recovery. The cluster
size has been significantly extended (~50% new OSDs) and started recovery.
During recovery there was a HW failure and we ended up with some PGS in
peered state with size < min_size (inactive).
Those peered PGs are waiting for backfill but the cluster still prefers
recovery of recovery_wait PGs - in our case this could be even few hours
before all recovery is finished (we're speeding up recovery up to limits
to get the downtime as short as possible). Those peered PGs are blocked
during this time and the whole cluster just struggles to operate at a
reasonable level.
We're running hammer 0.94.6 there and from the code it looks like
recovery will always have higher priority (jewel seems similar).
Documentation only says that log-based recovery must finish before
backfills. Is this requirement needed for data consistency or something
else?
Ideally we'd like it to be this order: undersized inactive (size <
min_size) recovery_wait => undersized inactive (size < min_size)
wait_backfill => degraded recovery_wait => degraded wait_backfill =>
remapped wait_backfill.
Changing priority calculation doesn't seem to be that hard but would it
end up with inconsistent data?
Bartek
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html