Hi Patrick: I want to ask a very tiny question. How much 9s do you claim your storage durability? And how is it calculated ? Based on the data you provided , have you find some failure model to refine the storage durability ? On Thu, Jun 15, 2017 at 12:09 AM, David Turner <drakonstein@xxxxxxxxx> wrote: > I understand concern over annoying drive manufacturers, but if you have data > to back it up you aren't slandering a drive manufacturer. If they don't > like the numbers that are found, then they should up their game or at least > request that you put in how your tests negatively affected their drive > endurance. For instance, WD Red drives are out of warranty just by being > placed in a chassis with more than 4 disks because they aren't rated for the > increased vibration from that many disks in a chassis. > > OTOH, if you are testing the drives within the bounds of the drives > warranty, and not doing anything against the recommendation of the > manufacturer in the test use case (both physical and software), then there > is no slander when you say that drive A outperformed drive B. I know that > the drives I run at home are not nearly as resilient as the drives that I > use at the office, but I don't put my home cluster through a fraction of the > strain that I do at the office. The manufacturer knows that their cheaper > drive isn't as resilient as the more robust enterprise drives. Anyway, I'm > sure you guys have thought about all of that and are _generally_ pretty > smart. ;) > > In an early warning system that detects a drive that is close to failing, > you could implement a command to migrate off of the disk and then run > non-stop IO on it to finish off the disk to satisfy warranties. Potentially > this could be implemented with the osd daemon via a burn-in start-up option. > Where it can be an OSD in the cluster that does not check in as up, but with > a different status so you can still monitor the health of the failing drive > from a ceph status. This could also be useful for people that would like to > burn-in their drives, but don't want to dedicate infrastructure to > burning-in new disks before deploying them. Making this as easy as possible > on the end user/ceph admin, there could even be a ceph.conf option for OSDs > that are added to the cluster and have never been been marked in to run > through a burn-in of X seconds (changeable in the config and defaults to 0 > as to not change the default behavior). I don't know if this is > over-thinking it or adding complexity where it shouldn't be, but it could be > used to get a drive to fail to use for an RMA. OTOH, for large deployments > we would RMA drives in batches and were never asked to prove that the drive > failed. We would RMA drives off of medium errors for HDDs and smart info > for SSDs and of course for full failures. > > On Wed, Jun 14, 2017 at 11:38 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> > wrote: >> >> Hi Patrick, >> >> We've just discussed this internally and I wanted to share some notes. >> >> First, there are at least three separate efforts in our IT dept to >> collect and analyse SMART data -- its clearly a popular idea and >> simple to implement, but this leads to repetition and begs for a >> common, good solution. >> >> One (perhaps trivial) issue is that it is hard to define exactly when >> a drive has failed -- it varies depending on the storage system. For >> Ceph I would define failure as EIO, which normally correlates with a >> drive medium error, but there were other ideas here. So if this should >> be a general purpose service, the sensor should have a pluggable >> failure indicator. >> >> There was also debate about what exactly we could do with a failure >> prediction model. Suppose the predictor told us a drive should fail in >> one week. We could proactively drain that disk, but then would it >> still fail? Will the vendor replace that drive under warranty only if >> it was *about to fail*? >> >> Lastly, and more importantly, there is a general hesitation to publish >> this kind of data openly, given how negatively it could impact a >> manufacturer. Our lab certainly couldn't publish a report saying "here >> are the most and least reliable drives". I don't know if anonymising >> the data sources would help here, but anyway I'm curious what are your >> thoughts on that point. Maybe what can come out of this are the >> _components_ of a drive reliability service, which could then be >> deployed privately or publicly as appropriate. >> >> Thanks! >> >> Dan >> >> >> >> >> On Wed, May 24, 2017 at 8:57 PM, Patrick McGarry <pmcgarry@xxxxxxxxxx> >> wrote: >> > Hey cephers, >> > >> > Just wanted to share the genesis of a new community project that could >> > use a few helping hands (and any amount of feedback/discussion that >> > you might like to offer). >> > >> > As a bit of backstory, around 2013 the Backblaze folks started >> > publishing statistics about hard drive reliability from within their >> > data center for the world to consume. This included things like model, >> > make, failure state, and SMART data. If you would like to view the >> > Backblaze data set, you can find it at: >> > >> > https://www.backblaze.com/b2/hard-drive-test-data.html >> > >> > While most major cloud providers are doing this for themselves >> > internally, we would like to replicate/enhance this effort across a >> > much wider segment of the population as a free service. I think we >> > have a pretty good handle on the server/platform side of things, and a >> > couple of people who have expressed interest in building the >> > reliability model (although we could always use more!), what we really >> > need is a passionate volunteer who would like to come forward to write >> > the agent that sits on the drives, aggregates data, and submits daily >> > stats reports via an API (and potentially receives information back as >> > results are calculated about MTTF or potential to fail in the next >> > 24-48 hrs). >> > >> > Currently my thinking is to build our collection method based on the >> > Backblaze data set so that we can use it to train our model and build >> > from going forward. If this sounds like a project you would like to be >> > involved in (especially if you're from Backblaze!) please let me know. >> > I think a first pass of the agent should be something we can build in >> > a couple of afternoons to start testing with a small pilot group that >> > we already have available. >> > >> > Happy to entertain any thoughts or feedback that people might have. >> > Thanks! >> > >> > -- >> > >> > Best Regards, >> > >> > Patrick McGarry >> > Director Ceph Community || Red Hat >> > http://ceph.com || http://community.redhat.com >> > @scuttlemonkey || @ceph >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com