Hi Patrick, We've just discussed this internally and I wanted to share some notes. First, there are at least three separate efforts in our IT dept to collect and analyse SMART data -- its clearly a popular idea and simple to implement, but this leads to repetition and begs for a common, good solution. One (perhaps trivial) issue is that it is hard to define exactly when a drive has failed -- it varies depending on the storage system. For Ceph I would define failure as EIO, which normally correlates with a drive medium error, but there were other ideas here. So if this should be a general purpose service, the sensor should have a pluggable failure indicator. There was also debate about what exactly we could do with a failure prediction model. Suppose the predictor told us a drive should fail in one week. We could proactively drain that disk, but then would it still fail? Will the vendor replace that drive under warranty only if it was *about to fail*? Lastly, and more importantly, there is a general hesitation to publish this kind of data openly, given how negatively it could impact a manufacturer. Our lab certainly couldn't publish a report saying "here are the most and least reliable drives". I don't know if anonymising the data sources would help here, but anyway I'm curious what are your thoughts on that point. Maybe what can come out of this are the _components_ of a drive reliability service, which could then be deployed privately or publicly as appropriate. Thanks! Dan On Wed, May 24, 2017 at 8:57 PM, Patrick McGarry <pmcgarry@xxxxxxxxxx> wrote: > Hey cephers, > > Just wanted to share the genesis of a new community project that could > use a few helping hands (and any amount of feedback/discussion that > you might like to offer). > > As a bit of backstory, around 2013 the Backblaze folks started > publishing statistics about hard drive reliability from within their > data center for the world to consume. This included things like model, > make, failure state, and SMART data. If you would like to view the > Backblaze data set, you can find it at: > > https://www.backblaze.com/b2/hard-drive-test-data.html > > While most major cloud providers are doing this for themselves > internally, we would like to replicate/enhance this effort across a > much wider segment of the population as a free service. I think we > have a pretty good handle on the server/platform side of things, and a > couple of people who have expressed interest in building the > reliability model (although we could always use more!), what we really > need is a passionate volunteer who would like to come forward to write > the agent that sits on the drives, aggregates data, and submits daily > stats reports via an API (and potentially receives information back as > results are calculated about MTTF or potential to fail in the next > 24-48 hrs). > > Currently my thinking is to build our collection method based on the > Backblaze data set so that we can use it to train our model and build > from going forward. If this sounds like a project you would like to be > involved in (especially if you're from Backblaze!) please let me know. > I think a first pass of the agent should be something we can build in > a couple of afternoons to start testing with a small pilot group that > we already have available. > > Happy to entertain any thoughts or feedback that people might have. Thanks! > > -- > > Best Regards, > > Patrick McGarry > Director Ceph Community || Red Hat > http://ceph.com || http://community.redhat.com > @scuttlemonkey || @ceph > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com