Re: Very slow I/O during rebalance - options to tune?

Frank Schilder <frans@xxxxxx> · Thu, 12 Aug 2021 15:25:22 +0000

> Wow, that is impressive and sounds opposite of what we see around
> here. Often rebalances directly and strongly impact client I/O.

It might be the missing settings:

osd_op_queue = wpq
osd_op_queue_cut_off = high

If the cluster comes from kraken, these might be inherited with different values. Set these on "global", its more than just the OSDs using these settings.

> I am though very strongly confused why PGs would actually turn to
> degraded, as the failure of the OSDs had been already corrected before.

This is expected behaviour. The OSDs are still present in the crush map and used for placement calculations. The idea is to reduce data movement during disk replacement between other (healthy) OSDs. The procedure is to let the cluster heal and use osd destroy to maintain the IDs:

 Monitor commands:
 =================
osd destroy <osdname (id|osd.id)> {--   mark osd as being destroyed. Keeps the
 yes-i-really-mean-it}                   ID intact (allowing reuse), but
                                         removes cephx keys, config-key data
                                         and lockbox keys, rendering data
                                         permanently unreadable.

Then you can deploy new OSDs on the same hosts with these IDs and data will move back with minimal movement between other OSDs again. The manual deployment commands accept OSD IDs as an optional argument for this reason.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Nico Schottelius <nico.schottelius@xxxxxxxxxxx>
Sent: 12 August 2021 15:56:49
To: Frank Schilder
Cc: Nico Schottelius; Ceph Users
Subject: Re:  Very slow I/O during rebalance - options to tune?

Hey Frank,

Frank Schilder <frans@xxxxxx> writes:

> The recovery_sleep options are the next choice to look at. Increase it and clients will get more I/O time slots. However, with your settings, I'm surprised clients are impacted at all. I usually leave the op-priority at its default and use osd-max-backfill=2..4 for HDDs. With this, clients usually don't notice anything. I'm running mimic 13.2.10 though.

Wow, that is impressive and sounds opposite of what we see around
here. Often rebalances directly and strongly impact client I/O.

I wonder if this is related to any inherited settings? This cluster used
to be kraken based and we usually followed the ceph upgrade guide, but
maybe some tunables are incorrect and influence the client i/o speed?

I'll note recovery_sleep for the next rebalance and see how it changes
the client I/O.

Something funky happened during this rebalance:

- the trigger was dead osd removal - osds that had ben out for days
- when triggering osd rm & crush remove, there were ~30 PGs marked as
  degraded
- It seems after the degraded ones have been fixed, the I/O was more
   back to normal

I am though very strongly confused why PGs would actually turn to
degraded, as the failure of the OSDs had been already corrected before.

Is this a bug or expected behaviour?

Cheers,

Nico

--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx