OSD cascading crash during recovery with corrupted replica

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sam/David,
We came across this problem a couple of times and it is extremely painful to work around it via operational steps, I would like to work on a patch, but before I start, it would be nice hear your suggestions.

The problem is:
On erasure coded pool, when there is a corruption, and the object is a recovery candidate, currently it would crash the primary when trying to recover the object, and so on so forth as other OSDs on the acting set to be promoted as primary, until the PG gets down.

Solution:
I think one way to fix it, is to put the object back to recovery waiting list together with the corruption information (add the corrupted shard to peering_missing), and then let it be picked up by the next round of recovery. 

Does that sound like a good way to pursue? Do you have any other suggestions I may look into?

Thanks,
Guang 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux