Re: Failure probability with largish deployments

Christian Balzer <chibi@xxxxxxx> · Mon, 23 Dec 2013 01:46:11 +0900

Hello,

On Sun, 22 Dec 2013 07:44:31 -0800 Kyle Bader wrote:

> > Is an object a CephFS file or a RBD image or is it the 4MB blob on the
> > actual OSD FS?
> 
> Objects are at the RADOS level, CephFS filesystems, RBD images and RGW
> objects are all composed by striping RADOS objects - default is 4MB.
> 
Good, that clears that up and confirms how I figured it worked.

> > In my case, I'm only looking at RBD images for KVM volume storage, even
> > given the default striping configuration I would assume that those
> > 12500 OSD objects for a 50GB image  would not be in the same PG and
> > thus just on 3 (with 3 replicas set) OSDs total?
> 
> Objects are striped across placement groups, so you take your RBD size
> / 4 MB and cap it at the total number of placement groups in your
> cluster.
> 

Yes, that also makes perfect sense, so the aforementioned 12500 objects
for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way
replication that makes 2400 PGs, following the recommended formula.

> > What amount of disks (OSDs) did you punch in for the following run?
> >> Disk Modeling Parameters
> >>     size:           3TiB
> >>     FIT rate:        826 (MTBF = 138.1 years)
> >>     NRE rate:    1.0E-16
> >> RADOS parameters
> >>     auto mark-out:     10 minutes
> >>     recovery rate:    50MiB/s (40 seconds/drive)
> > Blink???
> > I guess that goes back to the number of disks, but to restore 2.25GB at
> > 50MB/s with 40 seconds per drive...
> 
> The surviving replicas for placement groups that the failed OSDs
> participated will naturally be distributed across many OSDs in the
> cluster, when the failed OSD is marked out, it's replicas will be
> remapped to many OSDs. It's not a 1:1 replacement like you might find
> in a RAID array.
> 
I completely get that part, however the total amount of data to be
rebalanced after a single disk/OSD failure to fully restore redundancy is
still 2.25TB (mistyped that as GB earlier) at the 75% utilization you
assumed. 
What I'm still missing in this pictures is how many disks (OSDs) you
calculated this with. Maybe I'm just misreading the 40 seconds per drive
bit there. Because if that means each drive is only required to be just
active for 40 seconds to do it's bit of recovery, we're talking 1100
drives. ^o^ 1100 PGs would be another story.

> >>     osd fullness:      75%
> >>     declustering:    1100 PG/OSD
> >>     NRE model:              fail
> >>     object size:      4MB
> >>     stripe length:   1100
> > I take it that is to mean that any RBD volume of sufficient size is
> > indeed spread over all disks?
> 
> Spread over all placement groups, the difference is subtle but there
> is a difference.
> 
Right, it isn't exactly a 1:1 match from what I saw/read.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com