Hi Sam, We discussed this briefly on IRC, I think it might be better to recap with an email. Currently we schedule the backfill/recovery based on how degrade the PG is, with a factor distinguishing recovery vs. backfill (recovery always has higher priority). The way to calculate the degradation level of a PG is: {expected_pool_size} - {acting_set_size}. I think there are two issues with the current approach: 1. The current {acting_size_size} might not capture the degradation level over the past intervals. For example, we have two PGs (Erasure Coding with 8 data and 3 parity chunks) 1.0 and 1.1: 1.1 At t1, PG 1.0's acting set size becomes 8 while PG 1.1's acting set is 11 1.2 At t2, PG 1.1's acting set size becomes 10 while PG 1.1's acting set is 9 1.3 At t3, we start recovering (e.g. mark out some OSDs) With the current algorithm, PG 1.1 will recovery first and then PG 1.0 (if the concurrency is configured as 1), however, from a data durability's perspective, the data written between t1 and t2 are more degraded and risky. 2. The algorithm does not take EC/replication into account (and EC profile), which might be also important go consider the data durability. Is my understanding correct here? Thanks, Guang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html