Re: Very slow I/O during rebalance - options to tune?

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> · Thu, 12 Aug 2021 15:56:49 +0200

Hey Frank,

Frank Schilder <frans@xxxxxx> writes:

> The recovery_sleep options are the next choice to look at. Increase it and clients will get more I/O time slots. However, with your settings, I'm surprised clients are impacted at all. I usually leave the op-priority at its default and use osd-max-backfill=2..4 for HDDs. With this, clients usually don't notice anything. I'm running mimic 13.2.10 though.

Wow, that is impressive and sounds opposite of what we see around
here. Often rebalances directly and strongly impact client I/O.

I wonder if this is related to any inherited settings? This cluster used
to be kraken based and we usually followed the ceph upgrade guide, but
maybe some tunables are incorrect and influence the client i/o speed?

I'll note recovery_sleep for the next rebalance and see how it changes
the client I/O.

Something funky happened during this rebalance:

- the trigger was dead osd removal - osds that had ben out for days
- when triggering osd rm & crush remove, there were ~30 PGs marked as
  degraded
- It seems after the degraded ones have been fixed, the I/O was more
   back to normal

I am though very strongly confused why PGs would actually turn to
degraded, as the failure of the OSDs had been already corrected before.

Is this a bug or expected behaviour?

Cheers,

Nico

--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx