We currently have in src/common/options/global.yaml.in - name: osd_async_recovery_min_cost type: uint level: advanced desc: A mixture measure of number of current log entries difference and historical missing objects, above which we switch to use asynchronous recovery when appropriate default: 100 flags: - runtime I'd like to rephrase the description there in a PR, might you be able to share your insight into the dynamics so I can craft a better description? And do you have any thoughts on the default value? Might appropriate values vary by pool type and/or media? > On Apr 3, 2024, at 13:38, Joshua Baergen <jbaergen@xxxxxxxxxxxxxxxx> wrote: > > We've had success using osd_async_recovery_min_cost=0 to drastically > reduce slow ops during index recovery. > > Josh > > On Wed, Apr 3, 2024 at 11:29 AM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> wrote: >> >> I am fighting an issue on an 18.2.0 cluster where a restart of an OSD which >> supports the RGW index pool causes crippling slow ops. If the OSD is marked >> with primary-affinity of 0 prior to the OSD restart no slow ops are >> observed. If the OSD has a primary affinity of 1 slow ops occur. The slow >> ops only occur during the recovery period of the OMAP data and further only >> occur when client activity is allowed to pass to the cluster. Luckily I am >> able to test this during periods when I can disable all client activity at >> the upstream proxy. >> >> Given the behavior of the primary affinity changes preventing the slow ops >> I think this may be a case of recovery being more detrimental than >> backfill. I am thinking that causing an pg_temp acting set by forcing >> backfill may be the right method to mitigate the issue. [1] >> >> I believe that reducing the PG log entries for these OSDs would accomplish >> that but I am also thinking a tuning of osd_async_recovery_min_cost [2] may >> also accomplish something similar. Not sure the appropriate tuning for that >> config at this point or if there may be a better approach. Seeking any >> input here. >> >> Further if this issue sounds familiar or sounds like another condition >> within the OSD may be at hand I would be interested in hearing your input >> or thoughts. Thanks! >> >> [1] https://docs.ceph.com/en/latest/dev/peering/#concepts >> [2] >> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_async_recovery_min_cost >> >> Respectfully, >> >> *Wes Dillingham* >> LinkedIn <http://www.linkedin.com/in/wesleydillingham> >> wes@xxxxxxxxxxxxxxxxx >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx