I am fighting an issue on an 18.2.0 cluster where a restart of an OSD which supports the RGW index pool causes crippling slow ops. If the OSD is marked with primary-affinity of 0 prior to the OSD restart no slow ops are observed. If the OSD has a primary affinity of 1 slow ops occur. The slow ops only occur during the recovery period of the OMAP data and further only occur when client activity is allowed to pass to the cluster. Luckily I am able to test this during periods when I can disable all client activity at the upstream proxy. Given the behavior of the primary affinity changes preventing the slow ops I think this may be a case of recovery being more detrimental than backfill. I am thinking that causing an pg_temp acting set by forcing backfill may be the right method to mitigate the issue. [1] I believe that reducing the PG log entries for these OSDs would accomplish that but I am also thinking a tuning of osd_async_recovery_min_cost [2] may also accomplish something similar. Not sure the appropriate tuning for that config at this point or if there may be a better approach. Seeking any input here. Further if this issue sounds familiar or sounds like another condition within the OSD may be at hand I would be interested in hearing your input or thoughts. Thanks! [1] https://docs.ceph.com/en/latest/dev/peering/#concepts [2] https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_async_recovery_min_cost Respectfully, *Wes Dillingham* LinkedIn <http://www.linkedin.com/in/wesleydillingham> wes@xxxxxxxxxxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx