radosgw daemons with "stuck ops"

Reid Guyett <reid.guyett@xxxxxxxxx> · Mon, 27 Jan 2025 10:06:36 -0500

Hello,

We are experiencing slowdowns on one of our radosgw clusters. We restart
the radosgw daemons every 2 hours and things start getting slow after an
hour and a half. The avg get/put latencies go from 20ms/400ms to 1s/5s+
according to the metrics. When I stop traffic to one of the radosgw daemon
by setting it to DRAIN in HAProxy and HAProxy reports 0 sessions to the
daemon, I still see the objecter_requests still going up and down 15 min
later. linger_ops seems to stay at a constant 10 the whole time.

> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 46
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 211
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 427
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 0
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 26
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 190
>
The requests say they are "call rgw.bucket_list in=234b" and mostly
reference ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.N" where N
is 1-149 (the shard count for the bucket).

>   {
>     "tid": 2583119,
>     "pg": "8.579482b0",
>     "osd": 208,
>     "object_id":
> ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124",
>     "object_locator": "@8",
>     "target_object_id":
> ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124",
>     "target_object_locator": "@8",
>     "paused": 0,
>     "used_replica": 0,
>     "precalc_pgid": 0,
>     "last_sent": "1796975.496664s",
>     "age": 0.350012616,
>     "attempts": 1,
>     "snapid": "head",
>     "snap_context": "0=[]",
>     "mtime": "1970-01-01T00:00:00.000000+0000",
>     "osd_ops": [
>       "call rgw.bucket_list in=234b"
>     ]
>   },
>
I don't think it should be other rgw processes because we have this
daemon's ceph.conf set to disable the other threads.

> rgw enable gc threads = false
> rgw enable lc threads = false
> rgw dynamic resharding = false
>

When I restart the service while it is still in a DRAINED state in HAProxy,
checking the objecter_requests yields 0 even a few minutes after it has
been up.

> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 0
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 0
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 0
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 0
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 0
> [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> length'
> 0
>

Any thoughts on why these ops appear to be stuck/recurring until restarting
the daemon? I think this is related to our performance issues but I don't
know what the fix is.

The rgws are 18.2.4 running as containers in Podman on Debian 11. Our other
clusters do not exhibit this behavior.

Thanks!

Reid
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx