Hello, On Sun, 22 Dec 2013 07:44:31 -0800 Kyle Bader wrote: > > Is an object a CephFS file or a RBD image or is it the 4MB blob on the > > actual OSD FS? > > Objects are at the RADOS level, CephFS filesystems, RBD images and RGW > objects are all composed by striping RADOS objects - default is 4MB. > Good, that clears that up and confirms how I figured it worked. > > In my case, I'm only looking at RBD images for KVM volume storage, even > > given the default striping configuration I would assume that those > > 12500 OSD objects for a 50GB image would not be in the same PG and > > thus just on 3 (with 3 replicas set) OSDs total? > > Objects are striped across placement groups, so you take your RBD size > / 4 MB and cap it at the total number of placement groups in your > cluster. > Yes, that also makes perfect sense, so the aforementioned 12500 objects for a 50GB image, at a 60 TB cluster/pool with 72 disk/OSDs and 3 way replication that makes 2400 PGs, following the recommended formula. > > What amount of disks (OSDs) did you punch in for the following run? > >> Disk Modeling Parameters > >> size: 3TiB > >> FIT rate: 826 (MTBF = 138.1 years) > >> NRE rate: 1.0E-16 > >> RADOS parameters > >> auto mark-out: 10 minutes > >> recovery rate: 50MiB/s (40 seconds/drive) > > Blink??? > > I guess that goes back to the number of disks, but to restore 2.25GB at > > 50MB/s with 40 seconds per drive... > > The surviving replicas for placement groups that the failed OSDs > participated will naturally be distributed across many OSDs in the > cluster, when the failed OSD is marked out, it's replicas will be > remapped to many OSDs. It's not a 1:1 replacement like you might find > in a RAID array. > I completely get that part, however the total amount of data to be rebalanced after a single disk/OSD failure to fully restore redundancy is still 2.25TB (mistyped that as GB earlier) at the 75% utilization you assumed. What I'm still missing in this pictures is how many disks (OSDs) you calculated this with. Maybe I'm just misreading the 40 seconds per drive bit there. Because if that means each drive is only required to be just active for 40 seconds to do it's bit of recovery, we're talking 1100 drives. ^o^ 1100 PGs would be another story. > >> osd fullness: 75% > >> declustering: 1100 PG/OSD > >> NRE model: fail > >> object size: 4MB > >> stripe length: 1100 > > I take it that is to mean that any RBD volume of sufficient size is > > indeed spread over all disks? > > Spread over all placement groups, the difference is subtle but there > is a difference. > Right, it isn't exactly a 1:1 match from what I saw/read. Regards, Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com