RGW requests piling up

Gauvain Pocentek <gauvainpocentek@xxxxxxxxx> · Thu, 21 Dec 2023 13:40:09 +0100

Hello Ceph users,

We've been having an issue with RGW for a couple days and we would
appreciate some help, ideas, or guidance to figure out the issue.

We run a multi-site setup which has been working pretty fine so far. We
don't actually have data replication enabled yet, only metadata
replication. On the master region we've started to see requests piling up
in the rgw process, leading to very slow operations and failures all other
the place (clients timeout before getting responses from rgw). The
workaround for now is to restart the rgw containers regularly.

We've made a mistake and forcefully deleted a bucket on a secondary zone,
this might be the trigger but we are not sure.

Other symptoms include:

* Increased memory usage of the RGW processes (we bumped the container
limits from 4G to 48G to cater for that)
* Lots of read IOPS on the index pool (4 or 5 times more compared to what
we were seeing before)
* The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
active requests) seem to show that the number of concurrent requests
increases with time, although we don't see more requests coming in on the
load-balancer side.

The current thought is that the RGW process doesn't close the requests
properly, or that some requests just hang. After a restart of the process
things look OK but the situation turns bad fairly quickly (after 1 hour we
start to see many timeouts).

The rados cluster seems completely healthy, it is also used for rbd
volumes, and we haven't seen any degradation there.

Has anyone experienced that kind of issue? Anything we should be looking at?

Thanks for your help!

Gauvain
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx