Best practice K/M-parameters EC pool

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Tue, 26 Aug 2014 10:37:33 -0700

My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max
backfills = 1).   I believe that increases my risk of failure by 48^2 .
 Since your numbers are failure rate per hour per disk, I need to consider
the risk for the whole time for each disk.  So more formally, rebuild time
to the power of (replicas -1).

So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
higher risk than 1 / 10^8.

A risk of 1/43,000 means that I'm more likely to lose data due to human
error than disk failure.  Still, I can put a small bit of effort in to
optimize recovery speed, and lower this number.  Managing human error is
much harder.

On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org> wrote:

> Using percentages instead of numbers lead me to calculations errors. Here
> it is again using 1/100 instead of % for clarity ;-)
>
> Assuming that:
>
> * The pool is configured for three replicas (size = 3 which is the default)
> * It takes one hour for Ceph to recover from the loss of a single OSD
> * Any other disk has a 1/100,000 chance to fail within the hour following
> the failure of the first disk (assuming AFR
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> 1/100,000
> * A given disk does not participate in more than 100 PG
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140826/9839f367/attachment.htm>