Best practice K/M-parameters EC pool

mike.dawson@xxxxxxxxxxxx (Mike Dawson) · Thu, 28 Aug 2014 10:29:20 -0400

On 8/28/2014 12:23 AM, Christian Balzer wrote:
> On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:
>
>>
>>
>> On 27/08/2014 04:34, Christian Balzer wrote:
>>>
>>> Hello,
>>>
>>> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
>>>
>>>> Hi Craig,
>>>>
>>>> I assume the reason for the 48 hours recovery time is to keep the cost
>>>> of the cluster low ? I wrote "1h recovery time" because it is roughly
>>>> the time it would take to move 4TB over a 10Gb/s link. Could you
>>>> upgrade your hardware to reduce the recovery time to less than two
>>>> hours ? Or are there factors other than cost that prevent this ?
>>>>
>>>
>>> I doubt Craig is operating on a shoestring budget.
>>> And even if his network were to be just GbE, that would still make it
>>> only 10 hours according to your wishful thinking formula.
>>>
>>> He probably has set the max_backfills to 1 because that is the level of
>>> I/O his OSDs can handle w/o degrading cluster performance too much.
>>> The network is unlikely to be the limiting factor.
>>>
>>> The way I see it most Ceph clusters are in sort of steady state when
>>> operating normally, i.e. a few hundred VM RBD images ticking over, most
>>> actual OSD disk ops are writes, as nearly all hot objects that are
>>> being read are in the page cache of the storage nodes.
>>> Easy peasy.
>>>
>>> Until something happens that breaks this routine, like a deep scrub,
>>> all those VMs rebooting at the same time or a backfill caused by a
>>> failed OSD. Now all of a sudden client ops compete with the backfill
>>> ops, page caches are no longer hot, the spinners are seeking left and
>>> right. Pandemonium.
>>>
>>> I doubt very much that even with a SSD backed cluster you would get
>>> away with less than 2 hours for 4TB.
>>>
>>> To give you some real life numbers, I currently am building a new
>>> cluster but for the time being have only one storage node to play with.
>>> It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
>>> actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
>>>
>>> So I took out one OSD (reweight 0 first, then the usual removal steps)
>>> because the actual disk was wonky. Replaced the disk and re-added the
>>> OSD. Both operations took about the same time, 4 minutes for
>>> evacuating the OSD (having 7 write targets clearly helped) for measly
>>> 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
>>> OSD. And that is on one node (thus no network latency) that has the
>>> default parameters (so a max_backfill of 10) which was otherwise
>>> totally idle.
>>>
>>> In other words, in this pretty ideal case it would have taken 22 hours
>>> to re-distribute 4TB.
>>
>> That makes sense to me :-)
>>
>> When I wrote 1h, I thought about what happens when an OSD becomes
>> unavailable with no planning in advance. In the scenario you describe
>> the risk of a data loss does not increase since the objects are evicted
>> gradually from the disk being decommissioned and the number of replica
>> stays the same at all times. There is not a sudden drop in the number of
>> replica  which is what I had in mind.
>>
> That may be, but I'm rather certain that there is no difference in speed
> and priority of a rebalancing caused by an OSD set to weight 0 or one
> being set out.
>
>> If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
>> will start transferring a new replica of the objects they have to the
>> new OSD in their PG. The replacement will not be a single OSD although
>> nothing prevents the same OSD to be used in more than one PG as a
>> replacement for the lost one. If the cluster network is connected at
>> 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
>> duplicates do not originate from a single OSD but from at least dozens
>> of them and since they target more than one OSD, I assume we can expect
>> an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
>> account for the fact that the cluster network is never idle.
>>
>> Am I being too optimistic ?
> Vastly.
>
>> Do you see another blocking factor that
>> would significantly slow down recovery ?
>>
> As Craig and I keep telling you, the network is not the limiting factor.
> Concurrent disk IO is, as I pointed out in the other thread.

Completely agree.

On a production cluster with OSDs backed by spindles, even with OSD 
journals on SSDs, it is insufficient to calculate single-disk 
replacement backfill time based solely on network throughput. IOPS will 
likely be the limiting factor when backfilling a single failed spinner 
in a production cluster.

Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio 
of 3:1), with dual 1GbE bonded NICs.

Using the only throughput math, backfill could have theoretically 
completed in a bit over 2.5 hours, but it actually took 15 hours. I've 
done this a few times with similar results.

Why? Spindle contention on the replacement drive. Graph the '%util' 
metric from something like 'iostat -xt 2' during a single disk backfill 
to get a very clear view that spindle contention is the true limiting 
factor. It'll be pegged at or near 100% if spindle contention is the issue.

- Mike

>
> Another example if you please:
> My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs.
> 1 GbE links for client and cluster respectively.
> ---
> #ceph -s
>      cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
>       health HEALTH_OK
>       monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03
>       osdmap e1206: 4 osds: 4 up, 4 in
>        pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
>              141 GB used, 2323 GB / 2464 GB avail
>                   256 active+clean
> ---
> replication size is 2, in can do about 60MB/s writes with rados bench from
> a client.
>
> Setting one OSD out (the data distribution is nearly uniform) it took 12
> minutes to recover on a completely idle (no clients connected) cluster.
> The disk utilization was 70-90%, the cluster network hovered around 20%,
> never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
> Given the ceph log numbers and the data size, I make this a recovery speed
> of about 40MB/s or 13MB/s per OSD.
> Better than I expected, but a far cry from what the OSDs could do
> individually if they were not flooded with concurrent read and write
> requests by the backfilling operation.
>
> Now, more disks will help, but I very much doubt that this will scale
> linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).
>
> And this was an IDLE cluster.
>
> Doing this on a cluster with just about 10 client IOPS per OSD would be
> far worse. Never mind that people don't like their client IO to stall for
> more than a few seconds.
>
> Something that might improve this booth in terms of speed and impact to
> the clients would be something akin to the MD (linux software raid)
> recovery logic.
> As in, only one backfill operation per OSD (read or write, not both!) at
> the same time.
>
> Regards,
>
> Christian
>> Cheers
>>
>>> More in another reply.
>>>
>>>> Cheers
>>>>
>>>> On 26/08/2014 19:37, Craig Lewis wrote:
>>>>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
>>>>> max backfills = 1).   I believe that increases my risk of failure by
>>>>> 48^2 .  Since your numbers are failure rate per hour per disk, I need
>>>>> to consider the risk for the whole time for each disk.  So more
>>>>> formally, rebuild time to the power of (replicas -1).
>>>>>
>>>>> So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a
>>>>> much higher risk than 1 / 10^8.
>>>>>
>>>>>
>>>>> A risk of 1/43,000 means that I'm more likely to lose data due to
>>>>> human error than disk failure.  Still, I can put a small bit of
>>>>> effort in to optimize recovery speed, and lower this number.
>>>>> Managing human error is much harder.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
>>>>> <mailto:loic at dachary.org>> wrote:
>>>>>
>>>>>      Using percentages instead of numbers lead me to calculations
>>>>> errors. Here it is again using 1/100 instead of % for clarity ;-)
>>>>>
>>>>>      Assuming that:
>>>>>
>>>>>      * The pool is configured for three replicas (size = 3 which is
>>>>> the default)
>>>>>      * It takes one hour for Ceph to recover from the loss of a single
>>>>> OSD
>>>>>      * Any other disk has a 1/100,000 chance to fail within the hour
>>>>> following the failure of the first disk (assuming AFR
>>>>> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
>>>>> is 8%, divided by the number of hours during a year == (0.08 / 8760)
>>>>> ~= 1/100,000
>>>>>      * A given disk does not participate in more than 100 PG
>>>>>
>>>>
>>>
>>>
>>
>
>