On Tue, 1 Feb 2011 17:20:34 +0800 Henry Chang <henry.cy.chang@xxxxxxxxx> wrote: > > Yeah. I expect that scrub can both detect disk errors and check data > integrity (based on the checksum) in the background. For disk errors, > I would like CEPH to mark the OSD down/failed and notify the sys > admin immediately. For data errors, I expect that CEPH can repair > them automatically (by fetching a right copy from other replicas). > I suppose the best approach would be for this to be configurable with per OSD granularity. Something like an io_error_threshold config variable. I would set it to something like 50 or 100, but you could set it to 1 and the OSD would put itself down or out after that many IO errors that propagated up to the osd daemon. I guess that even if that OSD becomes unresponsive for a while it won't be much trouble, since ceph will mark it down and should recover later, or else the OSD will be out soon by itself due to the error threshold. What do you think? Cheers ClÃudio -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html