Best practice K/M-parameters EC pool

mike.dawson@xxxxxxxxxxxx (Mike Dawson) · Thu, 28 Aug 2014 13:38:03 -0400

On 8/28/2014 11:17 AM, Loic Dachary wrote:
>
>
> On 28/08/2014 16:29, Mike Dawson wrote:
>> On 8/28/2014 12:23 AM, Christian Balzer wrote:
>>> On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:
>>>
>>>>
>>>>
>>>> On 27/08/2014 04:34, Christian Balzer wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
>>>>>
>>>>>> Hi Craig,
>>>>>>
>>>>>> I assume the reason for the 48 hours recovery time is to keep the cost
>>>>>> of the cluster low ? I wrote "1h recovery time" because it is roughly
>>>>>> the time it would take to move 4TB over a 10Gb/s link. Could you
>>>>>> upgrade your hardware to reduce the recovery time to less than two
>>>>>> hours ? Or are there factors other than cost that prevent this ?
>>>>>>
>>>>>
>>>>> I doubt Craig is operating on a shoestring budget.
>>>>> And even if his network were to be just GbE, that would still make it
>>>>> only 10 hours according to your wishful thinking formula.
>>>>>
>>>>> He probably has set the max_backfills to 1 because that is the level of
>>>>> I/O his OSDs can handle w/o degrading cluster performance too much.
>>>>> The network is unlikely to be the limiting factor.
>>>>>
>>>>> The way I see it most Ceph clusters are in sort of steady state when
>>>>> operating normally, i.e. a few hundred VM RBD images ticking over, most
>>>>> actual OSD disk ops are writes, as nearly all hot objects that are
>>>>> being read are in the page cache of the storage nodes.
>>>>> Easy peasy.
>>>>>
>>>>> Until something happens that breaks this routine, like a deep scrub,
>>>>> all those VMs rebooting at the same time or a backfill caused by a
>>>>> failed OSD. Now all of a sudden client ops compete with the backfill
>>>>> ops, page caches are no longer hot, the spinners are seeking left and
>>>>> right. Pandemonium.
>>>>>
>>>>> I doubt very much that even with a SSD backed cluster you would get
>>>>> away with less than 2 hours for 4TB.
>>>>>
>>>>> To give you some real life numbers, I currently am building a new
>>>>> cluster but for the time being have only one storage node to play with.
>>>>> It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
>>>>> actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
>>>>>
>>>>> So I took out one OSD (reweight 0 first, then the usual removal steps)
>>>>> because the actual disk was wonky. Replaced the disk and re-added the
>>>>> OSD. Both operations took about the same time, 4 minutes for
>>>>> evacuating the OSD (having 7 write targets clearly helped) for measly
>>>>> 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
>>>>> OSD. And that is on one node (thus no network latency) that has the
>>>>> default parameters (so a max_backfill of 10) which was otherwise
>>>>> totally idle.
>>>>>
>>>>> In other words, in this pretty ideal case it would have taken 22 hours
>>>>> to re-distribute 4TB.
>>>>
>>>> That makes sense to me :-)
>>>>
>>>> When I wrote 1h, I thought about what happens when an OSD becomes
>>>> unavailable with no planning in advance. In the scenario you describe
>>>> the risk of a data loss does not increase since the objects are evicted
>>>> gradually from the disk being decommissioned and the number of replica
>>>> stays the same at all times. There is not a sudden drop in the number of
>>>> replica  which is what I had in mind.
>>>>
>>> That may be, but I'm rather certain that there is no difference in speed
>>> and priority of a rebalancing caused by an OSD set to weight 0 or one
>>> being set out.
>>>
>>>> If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
>>>> will start transferring a new replica of the objects they have to the
>>>> new OSD in their PG. The replacement will not be a single OSD although
>>>> nothing prevents the same OSD to be used in more than one PG as a
>>>> replacement for the lost one. If the cluster network is connected at
>>>> 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
>>>> duplicates do not originate from a single OSD but from at least dozens
>>>> of them and since they target more than one OSD, I assume we can expect
>>>> an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
>>>> account for the fact that the cluster network is never idle.
>>>>
>>>> Am I being too optimistic ?
>>> Vastly.
>>>
>>>> Do you see another blocking factor that
>>>> would significantly slow down recovery ?
>>>>
>>> As Craig and I keep telling you, the network is not the limiting factor.
>>> Concurrent disk IO is, as I pointed out in the other thread.
>>
>> Completely agree.
>>
>> On a production cluster with OSDs backed by spindles, even with OSD journals on SSDs, it is insufficient to calculate single-disk replacement backfill time based solely on network throughput. IOPS will likely be the limiting factor when backfilling a single failed spinner in a production cluster.
>>
>> Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 3:1), with dual 1GbE bonded NICs.
>>
>> Using the only throughput math, backfill could have theoretically completed in a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times with similar results.
>>
>> Why? Spindle contention on the replacement drive. Graph the '%util' metric from something like 'iostat -xt 2' during a single disk backfill to get a very clear view that spindle contention is the true limiting factor. It'll be pegged at or near 100% if spindle contention is the issue.
>
> Hi Mike,
>
> Did you by any chance also measure how long it took for the 3 replicas to be restored on all PG in which the failed disk was participating ? I assume the following sequence happened:
>
> A) The 3TB drive failed and contained ~2TB
> B) The cluster recovered by creating new replicas
> C) The new 3TB drive was installed
> D) Backfilling completed
>
> I'm interested in the time between A and B, i.e. when one copy is potentially lost forever, because this is when the probability of a permanent data loss increases. Although it is important to reduce the time between C and D to a minimum, it has no impact on the durability of the data.
>

Loic,

We use 3x replication and have drives that have relatively high 
steady-state IOPS. Therefore, we tend to prioritize client-side IO more 
than a reduction from 3 copies to 2 during the loss of one disk. The 
disruption to client io is so great on our cluster, we don't want our 
cluster to be in a recovery state without operator-supervision.

Letting OSDs get marked out without operator intervention was a disaster 
in the early going of our cluster. For example, an OSD daemon crash 
would trigger automatic recovery where it was unneeded. Ironically, 
often times the unneeded recovery would often trigger additional daemons 
to crash, making a bad situation worse. During the recovery, rbd client 
io would often times go to 0.

To deal with this issue, we set "mon osd down out interval = 14400", so 
as operators we have 4 hours to intervene before Ceph attempts to 
self-heal. When hardware is at fault, we remove the osd, replace the 
drive, re-add the osd, then allow backfill to begin, thereby completely 
skipping step B in your timeline above.

- Mike

> Cheers
>
>> - Mike
>>
>>
>>>
>>> Another example if you please:
>>> My shitty test cluster, 4 nodes, one OSD each, journal on disk, no SSDs.
>>> 1 GbE links for client and cluster respectively.
>>> ---
>>> #ceph -s
>>>       cluster 25bb48ec-689d-4cec-8494-d1a62ca509be
>>>        health HEALTH_OK
>>>        monmap e1: 1 mons at {irt03=192,168.0.33:6789/0}, election epoch 1, quorum 0 irt03
>>>        osdmap e1206: 4 osds: 4 up, 4 in
>>>         pgmap v543045: 256 pgs, 3 pools, 62140 MB data, 15648 objects
>>>               141 GB used, 2323 GB / 2464 GB avail
>>>                    256 active+clean
>>> ---
>>> replication size is 2, in can do about 60MB/s writes with rados bench from
>>> a client.
>>>
>>> Setting one OSD out (the data distribution is nearly uniform) it took 12
>>> minutes to recover on a completely idle (no clients connected) cluster.
>>> The disk utilization was 70-90%, the cluster network hovered around 20%,
>>> never exceeding 35% on the 3 "surviving" nodes. CPU was never an issue.
>>> Given the ceph log numbers and the data size, I make this a recovery speed
>>> of about 40MB/s or 13MB/s per OSD.
>>> Better than I expected, but a far cry from what the OSDs could do
>>> individually if they were not flooded with concurrent read and write
>>> requests by the backfilling operation.
>>>
>>> Now, more disks will help, but I very much doubt that this will scale
>>> linear, so 50 OSDs won't give you 500MB/s (somebody prove me wrong please).
>>>
>>> And this was an IDLE cluster.
>>>
>>> Doing this on a cluster with just about 10 client IOPS per OSD would be
>>> far worse. Never mind that people don't like their client IO to stall for
>>> more than a few seconds.
>>>
>>> Something that might improve this booth in terms of speed and impact to
>>> the clients would be something akin to the MD (linux software raid)
>>> recovery logic.
>>> As in, only one backfill operation per OSD (read or write, not both!) at
>>> the same time.
>>>
>>> Regards,
>>>
>>> Christian
>>>> Cheers
>>>>
>>>>> More in another reply.
>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 26/08/2014 19:37, Craig Lewis wrote:
>>>>>>> My OSD rebuild time is more like 48 hours (4TB disks, >60% full, osd
>>>>>>> max backfills = 1).   I believe that increases my risk of failure by
>>>>>>> 48^2 .  Since your numbers are failure rate per hour per disk, I need
>>>>>>> to consider the risk for the whole time for each disk.  So more
>>>>>>> formally, rebuild time to the power of (replicas -1).
>>>>>>>
>>>>>>> So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a
>>>>>>> much higher risk than 1 / 10^8.
>>>>>>>
>>>>>>>
>>>>>>> A risk of 1/43,000 means that I'm more likely to lose data due to
>>>>>>> human error than disk failure.  Still, I can put a small bit of
>>>>>>> effort in to optimize recovery speed, and lower this number.
>>>>>>> Managing human error is much harder.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary <loic at dachary.org
>>>>>>> <mailto:loic at dachary.org>> wrote:
>>>>>>>
>>>>>>>       Using percentages instead of numbers lead me to calculations
>>>>>>> errors. Here it is again using 1/100 instead of % for clarity ;-)
>>>>>>>
>>>>>>>       Assuming that:
>>>>>>>
>>>>>>>       * The pool is configured for three replicas (size = 3 which is
>>>>>>> the default)
>>>>>>>       * It takes one hour for Ceph to recover from the loss of a single
>>>>>>> OSD
>>>>>>>       * Any other disk has a 1/100,000 chance to fail within the hour
>>>>>>> following the failure of the first disk (assuming AFR
>>>>>>> https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk
>>>>>>> is 8%, divided by the number of hours during a year == (0.08 / 8760)
>>>>>>> ~= 1/100,000
>>>>>>>       * A given disk does not participate in more than 100 PG
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>