Re: Changing the failure domain

David Turner <drakonstein@xxxxxxxxx> · Fri, 01 Sep 2017 17:05:40 +0000

Don't discount failing drives. You can have drives in a "ready-to-fail" state that doesn't show up in SMART or anywhere easy to track. When backfilling, the drive is using sectors it may not normally use. I managed a 1400 osd cluster that would lose 1-3 drives in random nodes when I added new storage due to the large backfill that took place. We monitored dmesg, SMART, etc for detection of disk errors, but it would all of a sudden happen during the large backfill. Several times the osd didn't even have any SMART errors after it was dead.

It's easiest to track slow requests while they are happening. `ceph health detail` will report which osd the request is blocked in and might she'd something light. If a PG is peeing for a while, you can also check which osd it is stuck waiting on.

On Fri, Sep 1, 2017, 12:09 PM Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> wrote:
Hello,

We have checked all the drives, and there is no problem with them. If there would be a failing drive, then I think that the slow requests should appear also in the normal traffic as the ceph cluster is using all the OSDs as primaries for some PGs. But these slow requests are appearing only during the backfill. I will try to dig deeper into the IO operations at the next test.

Kind regards,

Laszlo

On 01.09.2017 16:08, David Turner wrote:

> That is normal to have backfilling because the crush map did change. The host and the chassis have crush numbers and their own weight which is the sum of the osds under them.  By moving the host into the chassis you changed the weight of the chassis and that affects the PG placement even though you didn't change the failure domain.

>

> Osd_max_backfills = 1 shouldn't impact customer traffic and cause blocked requests. Most people find that they can use 3-5 before the disks are active enough to come close to impacting customer traffic.  That would lead me to think you have a dying drive that you're reading from/writing to in sectors that are bad or at least slower.

>

>

> On Fri, Sep 1, 2017, 6:13 AM Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx>> wrote:

>

>     Hi David,

>

>     Well, most probably the larger part of our PGs will have to be reorganized, as we are moving from 9 hosts to 3 chassis. But I was hoping to be able to throttle the backfilling to an extent where it has minimal impact on our user traffic. Unfortunately I wasn't able to do it. I saw that the newer versions of ceph have the "osd recovery sleep" parameter. I think this would help, but unfortunately it's not present in hammer ... :(

>

>     Also I have an other question: Is it normal to have backfill when we add a host to a chassis even if we don't change the CRUSH rule? Let me explain: We have the hosts directly assigned to the root bucket. Then we add chassis to the root, and then we move a host from the root to the chassis. In all this time the rule set remains unchanged, with the host being the failure domain.

>

>     Kind regards,

>     Laszlo

>

>

>     On 31.08.2017 17:56, David Turner wrote:

>      > How long are you seeing these blocked requests for?  Initially or perpetually?  Changing the failure domain causes all PGs to peer at the same time.  This would be the cause if it happens really quickly.  There is no way to avoid all of them peering while making a change like this.  After that, It could easily be caused because a fair majority of your data is probably set to move around.  I would check what might be causing the blocked requests during this time.  See if there is an OSD that might be dying (large backfills have the tendancy to find a couple failing drives) which could easily cause things to block.  Also checking if your disks or journals are maxed out with iostat could shine some light on any mitigating factor.

>      >

>      > On Thu, Aug 31, 2017 at 9:01 AM Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx> <mailto:laszlo@xxxxxxxxxxxxxxxx <mailto:laszlo@xxxxxxxxxxxxxxxx>>> wrote:

>      >

>      >     Dear all!

>      >

>      >     In our Hammer cluster we are planning to switch our failure domain from host to chassis. We have performed some simulations, and regardless of the settings we have used some slow requests have appeared all the time.

>      >

>      >     we had the the following settings:

>      >

>      >        "osd_max_backfills": "1",

>      >           "osd_backfill_full_ratio": "0.85",

>      >           "osd_backfill_retry_interval": "10",

>      >           "osd_backfill_scan_min": "1",

>      >           "osd_backfill_scan_max": "4",

>      >           "osd_kill_backfill_at": "0",

>      >           "osd_debug_skip_full_check_in_backfill_reservation": "false",

>      >           "osd_debug_reject_backfill_probability": "0",

>      >

>      >          "osd_min_recovery_priority": "0",

>      >           "osd_allow_recovery_below_min_size": "true",

>      >           "osd_recovery_threads": "1",

>      >           "osd_recovery_thread_timeout": "60",

>      >           "osd_recovery_thread_suicide_timeout": "300",

>      >           "osd_recovery_delay_start": "0",

>      >           "osd_recovery_max_active": "1",

>      >           "osd_recovery_max_single_start": "1",

>      >           "osd_recovery_max_chunk": "8388608",

>      >           "osd_recovery_forget_lost_objects": "false",

>      >           "osd_recovery_op_priority": "1",

>      >           "osd_recovery_op_warn_multiple": "16",

>      >

>      >

>      >     we have also tested it with the CFQ IO scheduler on the OSDs and the following params:

>      >           "osd_disk_thread_ioprio_priority": "7"

>      >           "osd_disk_thread_ioprio_class": "idle"

>      >

>      >     and the nodeep-scrub set.

>      >

>      >     Is there anything else to try? Is there a good way to switch from one kind of failure domain to an other without slow requests?

>      >

>      >     Thank you in advance for any suggestions.

>      >

>      >     Kind regards,

>      >     Laszlo

>      >

>      >

>      >     _______________________________________________

>      >     ceph-users mailing list

>      > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>

>      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>      >

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com