To conclude this story, we finally discovered that one of our users was using a prometheus exporter (s3_exporter) that constantly listed the content of their buckets containing millions of objects. That really didn't play well with Ceph. 2 of these exporters were generating ~ 700k read IOPS on the index pool, and managed to kill the RGWs (14 of them) after a few hours. I hope this can help someone in the future. Gauvain On Fri, Dec 22, 2023 at 3:09 PM Gauvain Pocentek <gauvainpocentek@xxxxxxxxx> wrote: > I'd like to say that it was something smart but it was a bit of luck. > > I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the > same hosts) to deal with another issue, and while checking the system I > noticed that one of the OSDs was using a lot more CPU than the others. It > made me think that the increased IOPS could put a strain on some of the > OSDs without impacting the whole cluster so I decided to increate pg_num to > spread the operations to more OSDs, and it did the trick. The qlen metric > went back to something similar to what we had before the problems started. > > We're going to look into adding CPU/RAM monitoring for all the OSDs next. > > Gauvain > > On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver <drew.weaver@xxxxxxxxxx> > wrote: > >> Can you say how you determined that this was a problem? >> >> -----Original Message----- >> From: Gauvain Pocentek <gauvainpocentek@xxxxxxxxx> >> Sent: Friday, December 22, 2023 8:09 AM >> To: ceph-users@xxxxxxx >> Subject: Re: RGW requests piling up >> >> Hi again, >> >> It turns out that our rados cluster wasn't that happy, the rgw index pool >> wasn't able to handle the load. Scaling the PG number helped (256 to 512), >> and the RGW is back to a normal behaviour. >> >> There is still a huge number of read IOPS on the index, and we'll try to >> figure out what's happening there. >> >> Gauvain >> >> On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek < >> gauvainpocentek@xxxxxxxxx> >> wrote: >> >> > Hello Ceph users, >> > >> > We've been having an issue with RGW for a couple days and we would >> > appreciate some help, ideas, or guidance to figure out the issue. >> > >> > We run a multi-site setup which has been working pretty fine so far. >> > We don't actually have data replication enabled yet, only metadata >> > replication. On the master region we've started to see requests piling >> > up in the rgw process, leading to very slow operations and failures >> > all other the place (clients timeout before getting responses from >> > rgw). The workaround for now is to restart the rgw containers regularly. >> > >> > We've made a mistake and forcefully deleted a bucket on a secondary >> > zone, this might be the trigger but we are not sure. >> > >> > Other symptoms include: >> > >> > * Increased memory usage of the RGW processes (we bumped the container >> > limits from 4G to 48G to cater for that) >> > * Lots of read IOPS on the index pool (4 or 5 times more compared to >> > what we were seeing before) >> > * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of >> > active requests) seem to show that the number of concurrent requests >> > increases with time, although we don't see more requests coming in on >> > the load-balancer side. >> > >> > The current thought is that the RGW process doesn't close the requests >> > properly, or that some requests just hang. After a restart of the >> > process things look OK but the situation turns bad fairly quickly >> > (after 1 hour we start to see many timeouts). >> > >> > The rados cluster seems completely healthy, it is also used for rbd >> > volumes, and we haven't seen any degradation there. >> > >> > Has anyone experienced that kind of issue? Anything we should be >> > looking at? >> > >> > Thanks for your help! >> > >> > Gauvain >> > >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an >> email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx