Re: radosgw daemons with "stuck ops"

Joshua Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Mon, 27 Jan 2025 09:36:03 -0700

Hey Reid,

This sounds similar to what we saw in
https://tracker.ceph.com/issues/62256, in case that helps with your
investigation.

Josh

On Mon, Jan 27, 2025 at 8:07 AM Reid Guyett <reid.guyett@xxxxxxxxx> wrote:
>
> Hello,
>
> We are experiencing slowdowns on one of our radosgw clusters. We restart
> the radosgw daemons every 2 hours and things start getting slow after an
> hour and a half. The avg get/put latencies go from 20ms/400ms to 1s/5s+
> according to the metrics. When I stop traffic to one of the radosgw daemon
> by setting it to DRAIN in HAProxy and HAProxy reports 0 sessions to the
> daemon, I still see the objecter_requests still going up and down 15 min
> later. linger_ops seems to stay at a constant 10 the whole time.
>
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 46
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 211
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 427
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 0
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 26
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 190
> >
> The requests say they are "call rgw.bucket_list in=234b" and mostly
> reference ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.N" where N
> is 1-149 (the shard count for the bucket).
>
> >   {
> >     "tid": 2583119,
> >     "pg": "8.579482b0",
> >     "osd": 208,
> >     "object_id":
> > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124",
> >     "object_locator": "@8",
> >     "target_object_id":
> > ".dir.5e9bc383-f7bd-4fd1-b607-1e563bfe0011.886814949.1.124",
> >     "target_object_locator": "@8",
> >     "paused": 0,
> >     "used_replica": 0,
> >     "precalc_pgid": 0,
> >     "last_sent": "1796975.496664s",
> >     "age": 0.350012616,
> >     "attempts": 1,
> >     "snapid": "head",
> >     "snap_context": "0=[]",
> >     "mtime": "1970-01-01T00:00:00.000000+0000",
> >     "osd_ops": [
> >       "call rgw.bucket_list in=234b"
> >     ]
> >   },
> >
> I don't think it should be other rgw processes because we have this
> daemon's ceph.conf set to disable the other threads.
>
> > rgw enable gc threads = false
> > rgw enable lc threads = false
> > rgw dynamic resharding = false
> >
>
> When I restart the service while it is still in a DRAINED state in HAProxy,
> checking the objecter_requests yields 0 even a few minutes after it has
> been up.
>
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 0
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 0
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 0
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 0
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 0
> > [root@rgw1 ~]# ceph daemon client.rgw.rgw1 objecter_requests | jq '.ops |
> > length'
> > 0
> >
>
> Any thoughts on why these ops appear to be stuck/recurring until restarting
> the daemon? I think this is related to our performance issues but I don't
> know what the fix is.
>
> The rgws are 18.2.4 running as containers in Podman on Debian 11. Our other
> clusters do not exhibit this behavior.
>
> Thanks!
>
> Reid
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx