Best practice K/M-parameters EC pool

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Wed, 27 Aug 2014 13:45:46 -0700

I am using GigE.  I'm building a cluster using existing hardware, and the
network hasn't been my bottleneck (yet).

I've benchmarked the single disk recovery speed as about 50 MB/s, using max
backfills = 4, with SSD journals.  If I go higher, the disk bandwidth
increases slightly, and the latency starts increasing.
 At max backfills = 10, I regularly see OSD latency hit the 1 second mark.
 With max backfills = 4, OSD latency is pretty much the same as max
backfills = 1.  I haven't tested 5-9 yet.

I'm tracking latency by polling the OSD perf numbers every minute,
recording the delta from the previous poll, and calculating the average
latency over the last minute.  Given that it's an average over the last
minute, a 1 second average latency is way too high.  That usually means one
operation took > 30 seconds, and the other operations were mostly ok.  It's
common see blocked operations in ceph -w when latency is this high.

Using 50 MB/s for a single disk, that takes at least 14 hours to rebuild my
disks (4TB disk, 60% full).  If I'm not sitting in front of the computer, I
usually only run 2 backfills.  I'm very paranoid, due to some problems I
had early in the production release.  Most of these problems were caused by
64k XFS inodes, not Ceph.  But I have things working now, so I'm hesitant
to change anything.  :-)

On Tue, Aug 26, 2014 at 11:21 AM, Loic Dachary <loic at dachary.org> wrote:

> Hi Craig,
>
> I assume the reason for the 48 hours recovery time is to keep the cost of
> the cluster low ? I wrote "1h recovery time" because it is roughly the time
> it would take to move 4TB over a 10Gb/s link. Could you upgrade your
> hardware to reduce the recovery time to less than two hours ? Or are there
> factors other than cost that prevent this ?
>
> Cheers
>
> On 26/08/2014 19:37, Craig Lewis wrote:
> > My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd max
> backfills = 1).   I believe that increases my risk of failure by 48^2 .
> Since your numbers are failure rate per hour per disk, I need to consider
> the risk for the whole time for each disk.  So more formally, rebuild time
> to the power of (replicas -1).
> >
> > So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
> higher risk than 1 / 10^8.
> >
> >
> > A risk of 1/43,000 means that I'm more likely to lose data due to human
> error than disk failure.  Still, I can put a small bit of effort in to
> optimize recovery speed, and lower this number.  Managing human error is
> much harder.
> >
> >
> >
> >
> >
> >
> > On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org <mailto:
> loic at dachary.org>> wrote:
> >
> >     Using percentages instead of numbers lead me to calculations errors.
> Here it is again using 1/100 instead of % for clarity ;-)
> >
> >     Assuming that:
> >
> >     * The pool is configured for three replicas (size = 3 which is the
> default)
> >     * It takes one hour for Ceph to recover from the loss of a single OSD
> >     * Any other disk has a 1/100,000 chance to fail within the hour
> following the failure of the first disk (assuming AFR
> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> 1/100,000
> >     * A given disk does not participate in more than 100 PG
> >
>
> --
> Lo?c Dachary, Artisan Logiciel Libre
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140827/08b59175/attachment.htm>