Best practice K/M-parameters EC pool

loic@xxxxxxxxxxx (Loic Dachary) · Thu, 28 Aug 2014 17:17:44 +0200

On 28/08/2014 16:29, Mike Dawson wrote:
> On 8/28/2014 12:23 AM, Christian Balzer wrote:
>> On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:
>>
>>>
>>>
>>> On 27/08/2014 04:34, Christian Balzer wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
>>>>
>>>>> Hi Craig,
>>>>>
>>>>> I assume the reason for the 48 hours recovery time is to keep the cost
>>>>> of the cluster low ? I wrote "1h recovery time" because it is roughly
>>>>> the time it would take to move 4TB over a 10Gb/s link. Could you
>>>>> upgrade your hardware to reduce the recovery time to less than two
>>>>> hours ? Or are there factors other than cost that prevent this ?
>>>>>
>>>>
>>>> I doubt Craig is operating on a shoestring budget.
>>>> And even if his network were to be just GbE, that would still make it
>>>> only 10 hours according to your wishful thinking formula.
>>>>
>>>> He probably has set the max_backfills to 1 because that is the level of
>>>> I/O his OSDs can handle w/o degrading cluster performance too much.
>>>> The network is unlikely to be the limiting factor.
>>>>
>>>> The way I see it most Ceph clusters are in sort of steady state when
>>>> operating normally, i.e. a few hundred VM RBD images ticking over, most
>>>> actual OSD disk ops are writes, as nearly all hot objects that are
>>>> being read are in the page cache of the storage nodes.
>>>> Easy peasy.
>>>>
>>>> Until something happens that breaks this routine, like a deep scrub,
>>>> all those VMs rebooting at the same time or a backfill caused by a
>>>> failed OSD. Now all of a sudden client ops compete with the backfill
>>>> ops, page caches are no longer hot, the spinners are seeking left and
>>>> right. Pandemonium.
>>>>
>>>> I doubt very much that even with a SSD backed cluster you would get
>>>> away with less than 2 hours for 4TB.
>>>>
>>>> To give you some real life numbers, I currently am building a new
>>>> cluster but for the time being have only one storage node to play with.
>>>> It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
>>>> actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
>>>>
>>>> So I took out one OSD (reweight 0 first, then the usual removal steps)
>>>> because the actual disk was wonky. Replaced the disk and re-added the
>>>> OSD. Both operations took about the same time, 4 minutes for
>>>> evacuating the OSD (having 7 write targets clearly helped) for measly
>>>> 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
>>>> OSD. And that is on one node (thus no network latency) that has the
>>>> default parameters (so a max_backfill of 10) which was otherwise
>>>> totally idle.
>>>>
>>>> In other words, in this pretty ideal case it would have taken 22 hours
>>>> to re-distribute 4TB.
>>>
>>> That makes sense to me :-)
>>>
>>> When I wrote 1h, I thought about what happens when an OSD becomes
>>> unavailable with no planning in advance. In the scenario you describe
>>> the risk of a data loss does not increase since the objects are evicted
>>> gradually from the disk being decommissioned and the number of replica
>>> stays the same at all times. There is not a sudden drop in the number of
>>> replica  which is what I had in mind.
>>>
>> That may be, but I'm rather certain that there is no difference in speed
>> and priority of a rebalancing caused by an OSD set to weight 0 or one
>> being set out.
>>
>>> If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
>>> will start transferring a new replica of the objects they have to the
>>> new OSD in their PG. The replacement will not be a single OSD although
>>> nothing prevents the same OSD to be used in more than one PG as a
>>> replacement for the lost one. If the cluster network is connected at
>>> 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
>>> duplicates do not originate from a single OSD but from at least dozens
>>> of them and since they target more than one OSD, I assume we can expect
>>> an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
>>> account for the fact that the cluster network is never idle.
>>>
>>> Am I being too optimistic ?
>> Vastly.
>>
>>> Do you see another blocking factor that
>>> would significantly slow down recovery ?
>>>
>> As Craig and I keep telling you, the network is not the limiting factor.
>> Concurrent disk IO is, as I pointed out in the other thread.
> 
> Completely agree.
> 
> On a production cluster with OSDs backed by spindles, even with OSD journals on SSDs, it is insufficient to calculate single-disk replacement backfill time based solely on network throughput. IOPS will likely be the limiting factor when backfilling a single failed spinner in a production cluster.
> 
> Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 3:1), with dual 1GbE bonded NICs.
> 
> Using the only throughput math, backfill could have theoretically completed in a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times with similar results.
> 
> Why? Spindle contention on the replacement drive. Graph the '%util' metric from something like 'iostat -xt 2' during a single disk backfill to get a very clear view that spindle contention is the true limiting factor. It'll be pegged at or near 100% if spindle contention is the issue.

Hi Mike,

Did you by any chance also measure how long it took for the 3 replicas to be restored on all PG in which the failed disk was participating ? I assume the following sequence happened:

A) The 3TB drive failed and contained ~2TB
B) The cluster recovered by creating new replicas
C) The new 3TB drive was installed
D) Backfilling completed

I'm interested in the time between A and B, i.e. when one copy is potentially lost forever, because this is when the probability of a permanent data loss increases. Although it is important to reduce the time between C and D to a minimum, it has no impact on the durability of the data.

Cheers

> - Mike
> 
> 
>>
>> Another example if you please:
>> My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs.
>> 1 GbE links for client and cluster respectively.
>> ---
>> #ceph -s
>>      cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
>>       health HEALTH_OK
>>       monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03
>>       osdmap e1206: 4 osds: 4 up, 4 in
>>        pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
>>              141 GB used, 2323 GB / 2464 GB avail
>>                   256 active+clean
>> ---
>> replication size is 2, in can do about 60MB/s writes with rados bench from
>> a client.
>>
>> Setting one OSD out (the data distribution is nearly uniform) it took 12
>> minutes to recover on a completely idle (no clients connected) cluster.
>> The disk utilization was 70-90%, the cluster network hovered around 20%,
>> never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
>> Given the ceph log numbers and the data size, I make this a recovery speed
>> of about 40MB/s or 13MB/s per OSD.
>> Better than I expected, but a far cry from what the OSDs could do
>> individually if they were not flooded with concurrent read and write
>> requests by the backfilling operation.
>>
>> Now, more disks will help, but I very much doubt that this will scale
>> linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).
>>
>> And this was an IDLE cluster.
>>
>> Doing this on a cluster with just about 10 client IOPS per OSD would be
>> far worse. Never mind that people don't like their client IO to stall for
>> more than a few seconds.
>>
>> Something that might improve this booth in terms of speed and impact to
>> the clients would be something akin to the MD (linux software raid)
>> recovery logic.
>> As in, only one backfill operation per OSD (read or write, not both!) at
>> the same time.
>>
>> Regards,
>>
>> Christian
>>> Cheers
>>>
>>>> More in another reply.
>>>>
>>>>> Cheers
>>>>>
>>>>> On 26/08/2014 19:37, Craig Lewis wrote:
>>>>>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
>>>>>> max backfills = 1).   I believe that increases my risk of failure by
>>>>>> 48^2 .  Since your numbers are failure rate per hour per disk, I need
>>>>>> to consider the risk for the whole time for each disk.  So more
>>>>>> formally, rebuild time to the power of (replicas -1).
>>>>>>
>>>>>> So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a
>>>>>> much higher risk than 1 / 10^8.
>>>>>>
>>>>>>
>>>>>> A risk of 1/43,000 means that I'm more likely to lose data due to
>>>>>> human error than disk failure.  Still, I can put a small bit of
>>>>>> effort in to optimize recovery speed, and lower this number.
>>>>>> Managing human error is much harder.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
>>>>>> <mailto:loic at dachary.org>> wrote:
>>>>>>
>>>>>>      Using percentages instead of numbers lead me to calculations
>>>>>> errors. Here it is again using 1/100 instead of % for clarity ;-)
>>>>>>
>>>>>>      Assuming that:
>>>>>>
>>>>>>      * The pool is configured for three replicas (size = 3 which is
>>>>>> the default)
>>>>>>      * It takes one hour for Ceph to recover from the loss of a single
>>>>>> OSD
>>>>>>      * Any other disk has a 1/100,000 chance to fail within the hour
>>>>>> following the failure of the first disk (assuming AFR
>>>>>> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
>>>>>> is 8%, divided by the number of hours during a year == (0.08 / 8760)
>>>>>> ~= 1/100,000
>>>>>>      * A given disk does not participate in more than 100 PG
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Lo?c Dachary, Artisan Logiciel Libre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140828/1715acbc/attachment.pgp>