Slow ops during recovery for RGW index pool only when degraded OSD is primary

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Wed, 3 Apr 2024 13:28:25 -0400

I am fighting an issue on an 18.2.0 cluster where a restart of an OSD which
supports the RGW index pool causes crippling slow ops. If the OSD is marked
with primary-affinity of 0 prior to the OSD restart no slow ops are
observed. If the OSD has a primary affinity of 1 slow ops occur. The slow
ops only occur during the recovery period of the OMAP data and further only
occur when client activity is allowed to pass to the cluster. Luckily I am
able to test this during periods when I can disable all client activity at
the upstream proxy.

Given the behavior of the primary affinity changes preventing the slow ops
I think this may be a case of recovery being more detrimental than
backfill. I am thinking that causing an pg_temp acting set by forcing
backfill may be the right method to mitigate the issue. [1]

I believe that reducing the PG log entries for these OSDs would accomplish
that but I am also thinking a tuning of osd_async_recovery_min_cost [2] may
also accomplish something similar. Not sure the appropriate tuning for that
config at this point or if there may be a better approach. Seeking any
input here.

Further if this issue sounds familiar or sounds like another condition
within the OSD may be at hand I would be interested in hearing your input
or thoughts. Thanks!

[1] https://docs.ceph.com/en/latest/dev/peering/#concepts
[2]
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_async_recovery_min_cost

Respectfully,

*Wes Dillingham*
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
wes@xxxxxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx