Hey Reid, This sounds similar to what we saw in https://tracker.ceph.com/issues/62256, in case that helps with your investigation. Josh On Mon, Jan 27, 2025 at 8:07 AM Reid Guyett <reid.guyett@xxxxxxxxx> wrote: > > Hello, > > We are experiencing slowdowns on one of our radosgw clusters. We restart > the radosgw daemons every 2 hours and things start getting slow after an > hour and a half. The avg get/put latencies go from 20ms/400ms to 1s/5s+ > according to the metrics. When I stop traffic to one of the radosgw daemon > by setting it to DRAIN in HAProxy and HAProxy reports 0 sessions to the > daemon, I still see the objecter_requests still going up and down 15 min > later. linger_ops seems to stay at a constant 10 the whole time. > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 46 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 211 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 427 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 0 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 26 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 190 > > > The requests say they are "call rgw.bucket_list in=234b" and mostly > reference ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.N" where N > is 1-149 (the shard count for the bucket). > > > { > > "tid": 2583119, > > "pg": "8.579482b0", > > "osd": 208, > > "object_id": > > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124", > > "object_locator": "@8", > > "target_object_id": > > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124", > > "target_object_locator": "@8", > > "paused": 0, > > "used_replica": 0, > > "precalc_pgid": 0, > > "last_sent": "1796975.496664s", > > "age": 0.350012616, > > "attempts": 1, > > "snapid": "head", > > "snap_context": "0=[]", > > "mtime": "1970-01-01T00:00:00.000000+0000", > > "osd_ops": [ > > "call rgw.bucket_list in=234b" > > ] > > }, > > > I don't think it should be other rgw processes because we have this > daemon's ceph.conf set to disable the other threads. > > > rgw enable gc threads = false > > rgw enable lc threads = false > > rgw dynamic resharding = false > > > > When I restart the service while it is still in a DRAINED state in HAProxy, > checking the objecter_requests yields 0 even a few minutes after it has > been up. > > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 0 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 0 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 0 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 0 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 0 > > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops | > > length' > > 0 > > > > Any thoughts on why these ops appear to be stuck/recurring until restarting > the daemon? I think this is related to our performance issues but I don't > know what the fix is. > > The rgws are 18.2.4 running as containers in Podman on Debian 11. Our other > clusters do not exhibit this behavior. > > Thanks! > > Reid > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx