> The current code uses a “FIT rate multiplier” to include for instance > the effect of operations done in parallel. That multiplier (n) has an > effect on Pfail. In the initial failure, it is calculated using the > number of replicas and the stripe count as seen in > https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L86. So I'm not sure what term we want to use for what we are calculating the durability of but for the sake of this explanation I'll use "artifact", which will refer to a collection of objects that compose a: 1. RADOS object (stripe count=1) 2. RBD volume 3. RGW S3 or Swift object 4. RGW metadata pools 5. I'm probably forgetting something My interpretation of the models progression is: 1. Global population of placement groups, perhaps because we need the entire pool intact, eg. RGW metadata pools (upper bound for stripe count). 2. Subsection of placement groups with which we will place portions of our artifact eg. based on size of RBD/RGW artifacts striped across RADOS objects. 3. Multiplier, to account for the fact that the placement group will become degraded if any of it's members are marked out due to failure. > The thing that doesn’t have sense to me is the way the multiplier is > calculated for the failure of the remaining copies in > https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L92 > Why the stripes are not taking into account? Stripes are not taken into account because at this point in the model we are calculating the chances of the degraded placement group becoming further degraded by suffering the loss of another member. Failures of other placement groups in the same stripe, during the recovery of our placement group should be calculated as an independent event. > What is the purpose of > using the “declustering factor” on that equation? My understanding is the declustering factor is synonymous with placement groups (pg) > Is that equation > correct? I read this note by sage > https://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg01650.html > trying to clarify the role of PGs but didn’t help me to understand it. To distribute objects across the cluster we need to divvy up objects into groupings, in the context of Ceph those groupings are PGs (placement groups). There is a cost associated with maintaining each placement group, and the benefit is finer distribution granularity can improve utilization at the high end. This should be reflected in the full/nearfull tunables we set for our cluster: http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity > Besides, I have a simple question related with the equation on L86 for > the initial failure. The stripping process splits user content in > #number of objects, which equivalent to the stripe count. That group > of objects constitutes an object set. Each object is composed by one > or more stripes units. All stripes units (stripe count) are written in > parallel. Typically each object is mapped to a different disk. What > happen when the object set is full and a new object is started? It places a second (or more) object in one of the placement groups that already has another object belonging to the same artifact. In this way you can have arbitrarily sized artifacts and still limit the number of placement groups in order to reduce the probability of failure. -- Kyle Bader - Inktank Senior Solution Architect -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html