Question about recovery vs backfill priorities

Bartłomiej Święcki <bartlomiej.swiecki@xxxxxxxxxxxx> · Thu, 1 Dec 2016 11:21:35 +0100

Hi,

We're currently being hit by an issue with cluster recovery. The cluster 
size has been significantly extended (~50% new OSDs) and started recovery.
During recovery there was a HW failure and we ended up with some PGS in 
peered state with size < min_size (inactive).
Those peered PGs are waiting for backfill but the cluster still prefers 
recovery of recovery_wait PGs - in our case this could be even few hours 
before all recovery is finished (we're speeding up recovery up to limits 
to get the downtime as short as possible). Those peered PGs are blocked 
during this time and the whole cluster just struggles to operate at a 
reasonable level.

We're running hammer 0.94.6 there and from the code it looks like 
recovery will always have higher priority (jewel seems similar).
Documentation only says that log-based recovery must finish before 
backfills. Is this requirement needed for data consistency or something 
else?

Ideally we'd like it to be this order: undersized inactive (size < 
min_size) recovery_wait => undersized inactive (size < min_size) 
wait_backfill => degraded recovery_wait => degraded wait_backfill => 
remapped wait_backfill.
Changing priority calculation doesn't seem to be that hard but would it 
end up with inconsistent data?

Bartek

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html