Re: Failure probability with largish deployments

Kyle Bader <kyle.bader@xxxxxxxxx> · Sun, 22 Dec 2013 07:44:31 -0800

> Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> actual OSD FS?

Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
objects are all composed by striping RADOS objects - default is 4MB.

> In my case, I'm only looking at RBD images for KVM volume storage, even
> given the default striping configuration I would assume that those 12500
> OSD objects for a 50GB image  would not be in the same PG and thus just on
> 3 (with 3 replicas set) OSDs total?

Objects are striped across placement groups, so you take your RBD size
/ 4 MB and cap it at the total number of placement groups in your
cluster.

> What amount of disks (OSDs) did you punch in for the following run?
>> Disk Modeling Parameters
>>     size:           3TiB
>>     FIT rate:        826 (MTBF = 138.1 years)
>>     NRE rate:    1.0E-16
>> RADOS parameters
>>     auto mark-out:     10 minutes
>>     recovery rate:    50MiB/s (40 seconds/drive)
> Blink???
> I guess that goes back to the number of disks, but to restore 2.25GB at
> 50MB/s with 40 seconds per drive...

The surviving replicas for placement groups that the failed OSDs
participated will naturally be distributed across many OSDs in the
cluster, when the failed OSD is marked out, it's replicas will be
remapped to many OSDs. It's not a 1:1 replacement like you might find
in a RAID array.

>>     osd fullness:      75%
>>     declustering:    1100 PG/OSD
>>     NRE model:              fail
>>     object size:      4MB
>>     stripe length:   1100
> I take it that is to mean that any RBD volume of sufficient size is indeed
> spread over all disks?

Spread over all placement groups, the difference is subtle but there
is a difference.

-- 

Kyle
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com