Re: Failure probability with largish deployments

Kyle Bader <kyle.bader@xxxxxxxxx> · Thu, 26 Dec 2013 14:17:06 -0800

> Yes, that also makes perfect sense, so the aforementioned 12500 objects
> for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
> replication that makes 2400 PGs, following the recommended formula.
>
>> > What amount of disks (OSDs) did you punch in for the following run?
>> >> Disk Modeling Parameters
>> >>     size:           3TiB
>> >>     FIT rate:        826 (MTBF = 138.1 years)
>> >>     NRE rate:    1.0E-16
>> >> RADOS parameters
>> >>     auto mark-out:     10 minutes
>> >>     recovery rate:    50MiB/s (40 seconds/drive)
>> > Blink???
>> > I guess that goes back to the number of disks, but to restore 2.25GB at
>> > 50MB/s with 40 seconds per drive...
>>
>> The surviving replicas for placement groups that the failed OSDs
>> participated will naturally be distributed across many OSDs in the
>> cluster, when the failed OSD is marked out, it's replicas will be
>> remapped to many OSDs. It's not a 1:1 replacement like you might find
>> in a RAID array.
>>
> I completely get that part, however the total amount of data to be
> rebalanced after a single disk/OSD failure to fully restore redundancy is
> still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
> assumed.
> What I'm still missing in this pictures is how many disks (OSDs) you
> calculated this with. Maybe I'm just misreading the 40 seconds per drive
> bit there. Because if that means each drive is only required to be just
> active for 40 seconds to do it's bit of recovery, we're talking 1100
> drives. ^o^ 1100 PGs would be another story.

To recreate the modeling:

git clone https://github.com/ceph/ceph-tools.git
cd ceph-tools/models/reliability/
python main.py -g

I used the following values:

Disk Type: Enterprise
Size: 3000 GiB
Primary FITs: 826
Secondary FITS: 826
NRE Rate: 1.0E-16

RAID Type: RAID6
Replace (hours): 6
Rebuild (MiB/s): 500
Volumes: 11

RADOS Copies: 3
Mark-out (min): 10
Recovery (MiB/s): 50
Space Usage: 75%
Declustering (pg): 1100
Stripe length: 1100 (limited by pgs anyway)

RADOS sites: 1
Rep Latency (s): 0
Recovery (MiB/s): 10
Disaster (years): 1000
Site Recovery (days): 30

NRE Model: Fail
Period (years): 1
Object Size: 4MB

It seems that the number of disks is not considered when calculating
the recovery window, only the number of pgs

https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L68

I could also see the recovery rates varying based on the max osd
backfill tunable.

http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling

Doing both would improve the quality of models generated by the tool.

-- 

Kyle
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com