Hello, We are experiencing slowdowns on one of our radosgw clusters. We restart the radosgw daemons every 2 hours and things start getting slow after an hour and a half. The avg get/put latencies go from 20ms/400ms to 1s/5s+ according to the metrics. When I stop traffic to one of the radosgw daemon by setting it to DRAIN in HAProxy and HAProxy reports 0 sessions to the daemon, I still see the objecter_requests still going up and down 15 min later. linger_ops seems to stay at a constant 10 the whole time. > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 46 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 211 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 427 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 0 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 26 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 190 > The requests say they are "call rgw.bucket_list in=234b" and mostly reference ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.N" where N is 1-149 (the shard count for the bucket). > { > "tid": 2583119, > "pg": "8.579482b0", > "osd": 208, > "object_id": > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124", > "object_locator": "@8", > "target_object_id": > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124", > "target_object_locator": "@8", > "paused": 0, > "used_replica": 0, > "precalc_pgid": 0, > "last_sent": "1796975.496664s", > "age": 0.350012616, > "attempts": 1, > "snapid": "head", > "snap_context": "0=[]", > "mtime": "1970-01-01T00:00:00.000000+0000", > "osd_ops": [ > "call rgw.bucket_list in=234b" > ] > }, > I don't think it should be other rgw processes because we have this daemon's ceph.conf set to disable the other threads. > rgw enable gc threads = false > rgw enable lc threads = false > rgw dynamic resharding = false > When I restart the service while it is still in a DRAINED state in HAProxy, checking the objecter_requests yields 0 even a few minutes after it has been up. > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 0 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 0 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 0 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 0 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 0 > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > length' > 0 > Any thoughts on why these ops appear to be stuck/recurring until restarting the daemon? I think this is related to our performance issues but I don't know what the fix is. The rgws are 18.2.4 running as containers in Podman on Debian 11. Our other clusters do not exhibit this behavior. Thanks! Reid _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx