Re: Reliability model for RADOS - effects during second failures

Kyle Bader <kyle@xxxxxxxxxxx> · Wed, 2 Jul 2014 22:09:04 -0700

> The current code uses a “FIT rate multiplier” to include for instance
> the effect of operations done in parallel. That multiplier (n) has an
> effect on Pfail. In the initial failure, it is calculated using the
> number of replicas and the stripe count as seen in
> https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L86.

So I'm not sure what term we want to use for what we are calculating
the durability of but for the sake of this explanation I'll use
"artifact", which will refer to a collection of objects that compose
a:

1. RADOS object (stripe count=1)
2. RBD volume
3. RGW S3 or Swift object
4. RGW metadata pools
5. I'm probably forgetting something

My interpretation of the models progression is:

1. Global population of placement groups, perhaps because we need the
entire pool intact, eg. RGW metadata pools (upper bound for stripe
count).
2. Subsection of placement groups with which we will place portions of
our artifact eg. based on size of RBD/RGW artifacts striped across
RADOS objects.
3. Multiplier, to account for the fact that the placement group will
become degraded if any of it's members are marked out due to failure.

> The thing that doesn’t have sense to me is the way the multiplier is
> calculated for the failure of the remaining copies in
> https://github.com/ceph/ceph-tools/blob/master/models/reliability/RadosRely.py#L92
> Why the stripes are not taking into account?

Stripes are not taken into account because at this point in the model
we are calculating the chances of the degraded placement group
becoming further degraded by suffering the loss of another member.
Failures of other placement groups in the same stripe, during the
recovery of our placement group should be calculated as an independent
event.

> What is the purpose of
> using the “declustering factor” on that equation?

My understanding is the declustering factor is synonymous with
placement groups (pg)

> Is that equation
> correct? I read this note by sage
> https://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg01650.html
> trying to clarify the role of PGs but didn’t help me to understand it.

To distribute objects across the cluster we need to divvy up objects
into groupings, in the context of Ceph those groupings are PGs
(placement groups). There is a cost associated with maintaining each
placement group, and the benefit is finer distribution granularity can
improve utilization at the high end. This should be reflected in the
full/nearfull tunables we set for our cluster:

http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity

> Besides, I have a simple question related with the equation on L86 for
> the initial failure. The stripping process splits user content in
> #number of objects, which equivalent to the stripe count. That group
> of objects constitutes an object set. Each object is composed by one
> or more stripes units. All stripes units (stripe count) are written in
> parallel. Typically each object is mapped to a different disk.  What
> happen when the object set is full and a new object is started?

It places a second (or more) object in one of the placement groups
that already has another object belonging to the same artifact. In
this way you can have arbitrarily sized artifacts and still limit the
number of placement groups in order to reduce the probability of
failure.

-- 
Kyle Bader - Inktank
Senior Solution Architect
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html