RE: [ceph-users] Help build a drive reliability service!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The question is tiny, the answer is Yuge ;-)

Ceph by itself can't quote a specific durability. The actual durability is a combination of HW that you use, the failure scenarios that you're looking at and the specific configuration of Ceph. Ceph provides a toolkit that allows you to overcome the durability of a specific piece of HW to synthesize a system-level durability that's much (or more often, much much much much much) better.

To get a true system-level durability number you have to combine all of the different failure mode probabilities into an aggregate number. Usually failure modes are modeled as uncorrelated events which makes the math simple and is accurate enough for most purposes.

There are LOTS of failure modes (cluster-level, drive-level and sector-level failure modes all have scenarios that lead to data loss and hence impact system-level durability). But this thread is focused on drive-level events, so we'll confine ourselves to those.

For a simple case like 2x replication (i.e., you have two copies of data lying around -- RAID-1) you're looking at case where you get a first drive failure and then a second drive failure BEFORE you've had a chance to rebuild/recover from the first drive failure. This means that you actually have two input variables to the computation, the drive failure rate (typically quoted as AFR -- annual/average failure rate. The percentage of drives that will fail within a calendar year) AND the recovery time period. However, this is the per-drive durability and you wanted the cluster-level durability, you have to scale this up by the total number of drives in the system (since ANY drive failure ANYWHERE in the system presumably is a cluster-level durability failure and the events are uncorrelated)

The durability then becomes: "What are the odds that I'll have a second drive failure WHILE I'm still rebuilding the first drive TIMES the number of drives". Which is simply the AFR * rebuild time * # of drives (with suitable units conversions of course)

One warning: AFR isn't a constant number ;-), all drives (SSD or HDD) are subject to wear-out. In long-running cluster you will typically have a population of drives with varying age and you might need to factor that into your equations based on your expected expansion, tech refresh, drive retirement policies, etc.

Rebuild time can be tricky. First you have to include the time from when the first drive failures until you actually start the rebuild (is this a manually initiated process? How long before somebody actually swaps the drive and pushes the button to start, or do you have hot standbys?). Then you have to factor in the amount of data to be rebuilt (Ceph only rebuilds 'live' data, not the whole drive, so if you're cluster if 50% full you benefit from only rebuilding 1/2 of the drive), finally you have to figure in the rebuild rate. The last item is often a problem as the greater the rebuild rate, the less performance is available for normal operations. Essentially you have to overprovision your cluster's performance level to be able to perform rebuilds at a reasonable rate [in the extreme imagine if it took a YEAR to rebuild a drive...]ove

Triple replication or +2 erasure coding have essentially the same math ( potentially different rebuild rates :-)). What's the probability that you'll have three drive failures in the window of vulnerability which is a function of the rebuild time ).

In short, by overprovisioning on performance and raw capacity (replication/erasure coding) you can achieve arbitrarily high levels of insurance (durability) against this failure mode. It's a function of how big your wallet is....







Allen Samuels  
R&D Engineering Fellow 

Western Digital® 
Email:  allen.samuels@xxxxxxx 
Office:  +1-408-801-7030
Mobile: +1-408-780-6416 

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Z Will
Sent: Thursday, June 15, 2017 11:48 PM
To: Patrick McGarry <pmcgarry@xxxxxxxxxx>
Cc: Dan van der Ster <dan@xxxxxxxxxxxxxx>; Ceph Devel <ceph-devel@xxxxxxxxxxxxxxx>; Ceph-User <ceph-users@xxxxxxxx>; David Turner <drakonstein@xxxxxxxxx>
Subject: Re: [ceph-users] Help build a drive reliability service!

Hi Patrick:
    I want to ask a  very tiny question. How much 9s do you claim your storage durability? And how is it calculated ? Based on the data you provided , have you find some failure model to refine the storage durability ?

On Thu, Jun 15, 2017 at 12:09 AM, David Turner <drakonstein@xxxxxxxxx> wrote:
> I understand concern over annoying drive manufacturers, but if you 
> have data to back it up you aren't slandering a drive manufacturer.  
> If they don't like the numbers that are found, then they should up 
> their game or at least request that you put in how your tests 
> negatively affected their drive endurance.  For instance, WD Red 
> drives are out of warranty just by being placed in a chassis with more 
> than 4 disks because they aren't rated for the increased vibration from that many disks in a chassis.
>
> OTOH, if you are testing the drives within the bounds of the drives 
> warranty, and not doing anything against the recommendation of the 
> manufacturer in the test use case (both physical and software), then 
> there is no slander when you say that drive A outperformed drive B.  I 
> know that the drives I run at home are not nearly as resilient as the 
> drives that I use at the office, but I don't put my home cluster 
> through a fraction of the strain that I do at the office.  The 
> manufacturer knows that their cheaper drive isn't as resilient as the 
> more robust enterprise drives.  Anyway, I'm sure you guys have thought 
> about all of that and are _generally_ pretty smart. ;)
>
> In an early warning system that detects a drive that is close to 
> failing, you could implement a command to migrate off of the disk and 
> then run non-stop IO on it to finish off the disk to satisfy 
> warranties.  Potentially this could be implemented with the osd daemon via a burn-in start-up option.
> Where it can be an OSD in the cluster that does not check in as up, 
> but with a different status so you can still monitor the health of the 
> failing drive from a ceph status.  This could also be useful for 
> people that would like to burn-in their drives, but don't want to 
> dedicate infrastructure to burning-in new disks before deploying them.  
> Making this as easy as possible on the end user/ceph admin, there 
> could even be a ceph.conf option for OSDs that are added to the 
> cluster and have never been been marked in to run through a burn-in of 
> X seconds (changeable in the config and defaults to 0 as to not change 
> the default behavior).  I don't know if this is over-thinking it or 
> adding complexity where it shouldn't be, but it could be used to get a 
> drive to fail to use for an RMA.  OTOH, for large deployments we would 
> RMA drives in batches and were never asked to prove that the drive 
> failed.  We would RMA drives off of medium errors for HDDs and smart info for SSDs and of course for full failures.
>
> On Wed, Jun 14, 2017 at 11:38 AM Dan van der Ster <dan@xxxxxxxxxxxxxx>
> wrote:
>>
>> Hi Patrick,
>>
>> We've just discussed this internally and I wanted to share some notes.
>>
>> First, there are at least three separate efforts in our IT dept to 
>> collect and analyse SMART data -- its clearly a popular idea and 
>> simple to implement, but this leads to repetition and begs for a 
>> common, good solution.
>>
>> One (perhaps trivial) issue is that it is hard to define exactly when 
>> a drive has failed -- it varies depending on the storage system. For 
>> Ceph I would define failure as EIO, which normally correlates with a 
>> drive medium error, but there were other ideas here. So if this 
>> should be a general purpose service, the sensor should have a 
>> pluggable failure indicator.
>>
>> There was also debate about what exactly we could do with a failure 
>> prediction model. Suppose the predictor told us a drive should fail 
>> in one week. We could proactively drain that disk, but then would it 
>> still fail? Will the vendor replace that drive under warranty only if 
>> it was *about to fail*?
>>
>> Lastly, and more importantly, there is a general hesitation to 
>> publish this kind of data openly, given how negatively it could 
>> impact a manufacturer. Our lab certainly couldn't publish a report 
>> saying "here are the most and least reliable drives". I don't know if 
>> anonymising the data sources would help here, but anyway I'm curious 
>> what are your thoughts on that point. Maybe what can come out of this 
>> are the _components_ of a drive reliability service, which could then 
>> be deployed privately or publicly as appropriate.
>>
>> Thanks!
>>
>> Dan
>>
>>
>>
>>
>> On Wed, May 24, 2017 at 8:57 PM, Patrick McGarry 
>> <pmcgarry@xxxxxxxxxx>
>> wrote:
>> > Hey cephers,
>> >
>> > Just wanted to share the genesis of a new community project that 
>> > could use a few helping hands (and any amount of 
>> > feedback/discussion that you might like to offer).
>> >
>> > As a bit of backstory, around 2013 the Backblaze folks started 
>> > publishing statistics about hard drive reliability from within 
>> > their data center for the world to consume. This included things 
>> > like model, make, failure state, and SMART data. If you would like 
>> > to view the Backblaze data set, you can find it at:
>> >
>> > https://www.backblaze.com/b2/hard-drive-test-data.html
>> >
>> > While most major cloud providers are doing this for themselves 
>> > internally, we would like to replicate/enhance this effort across a 
>> > much wider segment of the population as a free service.  I think we 
>> > have a pretty good handle on the server/platform side of things, 
>> > and a couple of people who have expressed interest in building the 
>> > reliability model (although we could always use more!), what we 
>> > really need is a passionate volunteer who would like to come 
>> > forward to write the agent that sits on the drives, aggregates 
>> > data, and submits daily stats reports via an API (and potentially 
>> > receives information back as results are calculated about MTTF or 
>> > potential to fail in the next
>> > 24-48 hrs).
>> >
>> > Currently my thinking is to build our collection method based on 
>> > the Backblaze data set so that we can use it to train our model and 
>> > build from going forward. If this sounds like a project you would 
>> > like to be involved in (especially if you're from Backblaze!) please let me know.
>> > I think a first pass of the agent should be something we can build 
>> > in a couple of afternoons to start testing with a small pilot group 
>> > that we already have available.
>> >
>> > Happy to entertain any thoughts or feedback that people might have.
>> > Thanks!
>> >
>> > --
>> >
>> > Best Regards,
>> >
>> > Patrick McGarry
>> > Director Ceph Community || Red Hat
>> > http://ceph.com  ||  http://community.redhat.com @scuttlemonkey || 
>> > @ceph _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux