OSD cascading crash during recovery with corrupted replica

GuangYang <yguang11@xxxxxxxxxxx> · Wed, 14 Oct 2015 17:20:45 -0700

Hi Sam/David,
We came across this problem a couple of times and it is extremely painful to work around it via operational steps, I would like to work on a patch, but before I start, it would be nice hear your suggestions.

The problem is:
On erasure coded pool, when there is a corruption, and the object is a recovery candidate, currently it would crash the primary when trying to recover the object, and so on so forth as other OSDs on the acting set to be promoted as primary, until the PG gets down.

Solution:
I think one way to fix it, is to put the object back to recovery waiting list together with the corruption information (add the corrupted shard to peering_missing), and then let it be picked up by the next round of recovery. 

Does that sound like a good way to pursue? Do you have any other suggestions I may look into?

Thanks,
Guang 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html