Hello Ceph users, We've been having an issue with RGW for a couple days and we would appreciate some help, ideas, or guidance to figure out the issue. We run a multi-site setup which has been working pretty fine so far. We don't actually have data replication enabled yet, only metadata replication. On the master region we've started to see requests piling up in the rgw process, leading to very slow operations and failures all other the place (clients timeout before getting responses from rgw). The workaround for now is to restart the rgw containers regularly. We've made a mistake and forcefully deleted a bucket on a secondary zone, this might be the trigger but we are not sure. Other symptoms include: * Increased memory usage of the RGW processes (we bumped the container limits from 4G to 48G to cater for that) * Lots of read IOPS on the index pool (4 or 5 times more compared to what we were seeing before) * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of active requests) seem to show that the number of concurrent requests increases with time, although we don't see more requests coming in on the load-balancer side. The current thought is that the RGW process doesn't close the requests properly, or that some requests just hang. After a restart of the process things look OK but the situation turns bad fairly quickly (after 1 hour we start to see many timeouts). The rados cluster seems completely healthy, it is also used for rbd volumes, and we haven't seen any degradation there. Has anyone experienced that kind of issue? Anything we should be looking at? Thanks for your help! Gauvain _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx