Priority backfill/recovery based on the degradation level of the PG

GuangYang <yguang11@xxxxxxxxxxx> · Mon, 14 Sep 2015 14:29:39 -0700

Hi Sam,
We discussed this briefly on IRC, I think it might be better to recap with an email.

Currently we schedule the backfill/recovery based on how degrade the PG is, with a factor distinguishing recovery vs. backfill (recovery always has higher priority). The way to calculate the degradation level of a PG is: {expected_pool_size} - {acting_set_size}. I think there are two issues with the current approach:

1. The current {acting_size_size} might not capture the degradation level over the past intervals. For example, we have two PGs (Erasure Coding with 8 data and 3 parity chunks) 1.0 and 1.1:
     1.1 At t1, PG 1.0's acting set size becomes 8 while PG 1.1's acting set is 11
     1.2 At t2, PG 1.1's acting set size becomes 10 while PG 1.1's acting set is 9
     1.3 At t3, we start recovering (e.g. mark out some OSDs)
With the current algorithm, PG 1.1 will recovery first and then PG 1.0 (if the concurrency is configured as 1), however, from a data durability's perspective, the data written between t1 and t2 are more degraded and risky.

2. The algorithm does not take EC/replication into account (and EC profile), which might be also important go consider the data durability.

Is my understanding correct here?

Thanks,
Guang 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html