On 12/27/2013 08:15 PM, Justin Erenkrantz wrote: > On Thu, Dec 26, 2013 at 9:17 PM, Sage Weil <sage@xxxxxxxxxxx> wrote: >> I think the question comes down to whether Ceph should take some internal >> action based on the information, or whether that is better handled by some >> external monitoring agent. For example, an external agent might collect >> SMART info into graphite, and every so often do some predictive analysis >> and mark out disks that are expected to fail soon. >> >> I'd love to see some consensus form around what this should look like... > > My $.02 from the peanut gallery: at a minimum, set the HEALTH_WARN flag if > there is a SMART failure on a physical drive that contains an OSD. Yes, > you could build the monitoring into a separate system, but I think it'd be > really useful to combine it into the cluster health assessment. -- justin > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hi, Judging from my personal experience SMART failures can be dangerous if they are not bad enough to completely tear down an OSD therefore it will not flap and will not be marked as down in time, but cluster performance is greatly affected in this case. I don`t think that the SMART monitoring task is somehow related to Ceph because seperate monitoring of predictive failure counters can do its job well and in cause of sudden errors SMART query may not work at all since a lot of bus resets was made by the system and disk can be inaccessible at all. So I propose two set of strategies - do a regular scattered background checks and monitor OSD responsiveness to word around cases with performance degradation due to read/write errors. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html