Re: backfilling kills rbd performance

Sridhar Seshasayee <sseshasa@xxxxxxxxxx> · Mon, 21 Nov 2022 12:31:53 +0530

Hi Martin,

In Quincy the osd_op_queue defaults to 'mclock_scheduler'. It was set to
'wpq' before Quincy.

> on a 3 node hyper converged pve cluster with 12 SSD osd devices I do
> experience stalls in the rbd performance during normal backfill
> operations e.g. moving a pool from 2/1 to 3/2.
>
> I was expecting that I could control the load caused by the backfilling
> using
>
> ceph tell 'osd.*' injectargs '--osd-max-backfills 1'
> or
> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 1'
> even
> ceph tell 'osd.*' config set osd_recovery_sleep_ssd 2.1
> did not help.
>
> Any hints?
>

Due to the way mclock scheduler works, all the sleep options along with
backfill and recovery limits cannot be modified. This is documented here:
https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#mclock-built-in-profiles

I am running Ceph Quincy 17.2.5 on a test system with dedicated
> 1Gbit/9000MTU storage network, while the public ceph network
> 1GBit/1500MTU is shared with the vm network.
>
> I am looking forward to you suggestions.
>
> The following optimizations are slated to be merged that modify the above
behavior especially with backfill/recovery that you are observing:

1. Reduce the current high limit set for backfill/recovery operations that
could
overwhelm client operations in some situations.

2. Allow users to modify the backfill/recovery limits if required using
another
gating option.

3. Optimize the mclock profiles so that client and recovery operations get
the
desired IOPS allocations.

Until the next upcoming Quincy release, to avoid the backfill/recovery
issue, you can
switch to the 'wpq' scheduler by setting osd_op_queue = wpq and restarting
the osds.
-Sridhar
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx