Re: RGW requests piling up

Gauvain Pocentek <gauvainpocentek@xxxxxxxxx> · Fri, 22 Dec 2023 14:08:47 +0100

Hi again,

It turns out that our rados cluster wasn't that happy, the rgw index pool
wasn't able to handle the load. Scaling the PG number helped (256 to 512),
and the RGW is back to a normal behaviour.

There is still a huge number of read IOPS on the index, and we'll try to
figure out what's happening there.

Gauvain

On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek <gauvainpocentek@xxxxxxxxx>
wrote:

> Hello Ceph users,
>
> We've been having an issue with RGW for a couple days and we would
> appreciate some help, ideas, or guidance to figure out the issue.
>
> We run a multi-site setup which has been working pretty fine so far. We
> don't actually have data replication enabled yet, only metadata
> replication. On the master region we've started to see requests piling up
> in the rgw process, leading to very slow operations and failures all other
> the place (clients timeout before getting responses from rgw). The
> workaround for now is to restart the rgw containers regularly.
>
> We've made a mistake and forcefully deleted a bucket on a secondary zone,
> this might be the trigger but we are not sure.
>
> Other symptoms include:
>
> * Increased memory usage of the RGW processes (we bumped the container
> limits from 4G to 48G to cater for that)
> * Lots of read IOPS on the index pool (4 or 5 times more compared to what
> we were seeing before)
> * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
> active requests) seem to show that the number of concurrent requests
> increases with time, although we don't see more requests coming in on the
> load-balancer side.
>
> The current thought is that the RGW process doesn't close the requests
> properly, or that some requests just hang. After a restart of the process
> things look OK but the situation turns bad fairly quickly (after 1 hour we
> start to see many timeouts).
>
> The rados cluster seems completely healthy, it is also used for rbd
> volumes, and we haven't seen any degradation there.
>
> Has anyone experienced that kind of issue? Anything we should be looking
> at?
>
> Thanks for your help!
>
> Gauvain
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx