disk failure prediction

Sage Weil <sweil@xxxxxxxxxx> · Wed, 18 Feb 2015 15:20:36 -0800 (PST)

Interesting paper at FAST:

	https://www.usenix.org/system/files/conference/fast15/fast15-paper-ma.pdf

Short version: reallocated sectors correllates with impending disk 
failures (this sounds like what Sandon has been telling us for ages) and 
by preemptively replacing disks with impending failures reduced EMC's rate 
of triple-failures by 80%, and looking at the joint failure probability 
within each raid set reduces the failure rate by 98%.  We wouldn't see 
quite the same results since our "raid sets" are effectively entire pools, 
but this seems like a strong case for adding smart monitoring to the osds 
or to calamari already and doing some preemptive disk replacement.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html