Best practice K/M-parameters EC pool

chibi@xxxxxxx (Christian Balzer) · Wed, 27 Aug 2014 11:34:43 +0900

Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:

> Hi Craig,
> 
> I assume the reason for the 48 hours recovery time is to keep the cost
> of the cluster low ? I wrote "1h recovery time" because it is roughly
> the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
> your hardware to reduce the recovery time to less than two hours ? Or
> are there factors other than cost that prevent this ?
> 

I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it only
10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are being
read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub, all
those VMs rebooting at the same time or a backfill caused by a failed OSD.
Now all of a sudden client ops compete with the backfill ops, page caches
are no longer hot, the spinners are seeking left and right. 
Pandemonium.

I doubt very much that even with a SSD backed cluster you would get away
with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new cluster
but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8 actual
OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the OSD.
Both operations took about the same time, 4 minutes for evacuating the OSD
(having 7 write targets clearly helped) for measly 12GB or about 50MB/s
and 5 minutes or about 35MB/ for refilling the OSD. 
And that is on one node (thus no network latency) that has the default
parameters (so a max_backfill of 10) which was otherwise totally idle. 

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.

More in another reply.

> Cheers
> 
> On 26/08/2014 19:37, Craig Lewis wrote:
> > My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
> > max backfills = 1).   I believe that increases my risk of failure by
> > 48^2 .  Since your numbers are failure rate per hour per disk, I need
> > to consider the risk for the whole time for each disk.  So more
> > formally, rebuild time to the power of (replicas -1).
> > 
> > So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
> > higher risk than 1 / 10^8.
> > 
> > 
> > A risk of 1/43,000 means that I'm more likely to lose data due to
> > human error than disk failure.  Still, I can put a small bit of effort
> > in to optimize recovery speed, and lower this number.  Managing human
> > error is much harder.
> > 
> > 
> > 
> > 
> > 
> > 
> > On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
> > <mailto:loic at dachary.org>> wrote:
> > 
> >     Using percentages instead of numbers lead me to calculations
> > errors. Here it is again using 1/100 instead of % for clarity ;-)
> > 
> >     Assuming that:
> > 
> >     * The pool is configured for three replicas (size = 3 which is the
> > default)
> >     * It takes one hour for Ceph to recover from the loss of a single
> > OSD
> >     * Any other disk has a 1/100,000 chance to fail within the hour
> > following the failure of the first disk (assuming AFR
> > https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
> > 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
> > 1/100,000
> >     * A given disk does not participate in more than 100 PG
> > 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/