Best practice K/M-parameters EC pool

loic@xxxxxxxxxxx (Loic Dachary) · Wed, 27 Aug 2014 13:04:48 +0200

On 27/08/2014 04:34, Christian Balzer wrote:
> 
> Hello,
> 
> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
> 
>> Hi Craig,
>>
>> I assume the reason for the 48 hours recovery time is to keep the cost
>> of the cluster low ? I wrote "1h recovery time" because it is roughly
>> the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
>> your hardware to reduce the recovery time to less than two hours ? Or
>> are there factors other than cost that prevent this ?
>>
> 
> I doubt Craig is operating on a shoestring budget.
> And even if his network were to be just GbE, that would still make it only
> 10 hours according to your wishful thinking formula.
> 
> He probably has set the max_backfills to 1 because that is the level of
> I/O his OSDs can handle w/o degrading cluster performance too much.
> The network is unlikely to be the limiting factor.
> 
> The way I see it most Ceph clusters are in sort of steady state when
> operating normally, i.e. a few hundred VM RBD images ticking over, most
> actual OSD disk ops are writes, as nearly all hot objects that are being
> read are in the page cache of the storage nodes.
> Easy peasy.
> 
> Until something happens that breaks this routine, like a deep scrub, all
> those VMs rebooting at the same time or a backfill caused by a failed OSD.
> Now all of a sudden client ops compete with the backfill ops, page caches
> are no longer hot, the spinners are seeking left and right. 
> Pandemonium.
> 
> I doubt very much that even with a SSD backed cluster you would get away
> with less than 2 hours for 4TB.
> 
> To give you some real life numbers, I currently am building a new cluster
> but for the time being have only one storage node to play with.
> It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8 actual
> OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
> 
> So I took out one OSD (reweight 0 first, then the usual removal steps)
> because the actual disk was wonky. Replaced the disk and re-added the OSD.
> Both operations took about the same time, 4 minutes for evacuating the OSD
> (having 7 write targets clearly helped) for measly 12GB or about 50MB/s
> and 5 minutes or about 35MB/ for refilling the OSD. 
> And that is on one node (thus no network latency) that has the default
> parameters (so a max_backfill of 10) which was otherwise totally idle. 
> 
> In other words, in this pretty ideal case it would have taken 22 hours
> to re-distribute 4TB.

That makes sense to me :-) 

When I wrote 1h, I thought about what happens when an OSD becomes unavailable with no planning in advance. In the scenario you describe the risk of a data loss does not increase since the objects are evicted gradually from the disk being decommissioned and the number of replica stays the same at all times. There is not a sudden drop in the number of replica  which is what I had in mind.

If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will start transferring a new replica of the objects they have to the new OSD in their PG. The replacement will not be a single OSD although nothing prevents the same OSD to be used in more than one PG as a replacement for the lost one. If the cluster network is connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD but from at least dozens of them and since they target more than one OSD, I assume we can expect an actual throughput of 5Gb/s. I should have written 2h instead of 1h to account for the fact that the cluster network is never idle.

Am I being too optimistic ? Do you see another blocking factor that would significantly slow down recovery ?

Cheers

> More in another reply.
> 
>> Cheers
>>
>> On 26/08/2014 19:37, Craig Lewis wrote:
>>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
>>> max backfills = 1).   I believe that increases my risk of failure by
>>> 48^2 .  Since your numbers are failure rate per hour per disk, I need
>>> to consider the risk for the whole time for each disk.  So more
>>> formally, rebuild time to the power of (replicas -1).
>>>
>>> So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
>>> higher risk than 1 / 10^8.
>>>
>>>
>>> A risk of 1/43,000 means that I'm more likely to lose data due to
>>> human error than disk failure.  Still, I can put a small bit of effort
>>> in to optimize recovery speed, and lower this number.  Managing human
>>> error is much harder.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
>>> <mailto:loic at dachary.org>> wrote:
>>>
>>>     Using percentages instead of numbers lead me to calculations
>>> errors. Here it is again using 1/100 instead of % for clarity ;-)
>>>
>>>     Assuming that:
>>>
>>>     * The pool is configured for three replicas (size = 3 which is the
>>> default)
>>>     * It takes one hour for Ceph to recover from the loss of a single
>>> OSD
>>>     * Any other disk has a 1/100,000 chance to fail within the hour
>>> following the failure of the first disk (assuming AFR
>>> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
>>> 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
>>> 1/100,000
>>>     * A given disk does not participate in more than 100 PG
>>>
>>
> 
> 

-- 
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140827/9f35796c/attachment.pgp>