Re: RGW seems to not clean up after some requests

Abhishek Lekshmanan <abhishek@xxxxxxxx> · Mon, 02 Nov 2020 14:54:07 +0100

Denis Krienbühl <denis@xxxxxxx> writes:

> Hi everyone
>
> We have faced some RGW outages recently, with the RGW returning HTTP 503. First for a few, then for most, then all requests - in the course of 1-2 hours. This seems to have started since we have updated from 15.2.4 to 15.2.5.
>
> The line that accompanies these outages in the log is the following:
>
> 	s3:list_bucket Scheduling request failed with -2218
There isn't much in terms of code changes in the scheduler from
v15.2.4->5. Does the perf dump (`ceph daemon perf dump <client.rgw-name>
`) on RGW socket show any throttle counts?

>
> It first pops up a few times here and there, until it eventually applies to all requests. It seems to indicate that the throttler has reached the limit of open connections.
>
> As we run a pair of HAProxy instances in front of RGW, which limit the number of connections to the two RGW instances to 400, this limit should never be reached. We do use RGW metadata sync between the instances, which could account for some extra connections, but if I look at open TCP connections between the instances I can count no more than 20 at any given time.
>
> I also noticed that some connections in the RGW log seem to never complete. That is, I can find a ‘starting new request’ line, but no associated ‘req done’ or ‘beast’ line.
>
> I don’t think there are any hung connections around, as they are killed by HAProxy after a short timeout.
>
> Looking at the code, it seems as if the throttler in use (SimpleThrottler), eventually reaches the maximum count of 1024 connections (outstanding_requests), and never recovers. I believe that the request_complete function is not called in all cases, but I am not familiar with the Ceph codebase, so I am not sure.
>
> See https://github.com/ceph/ceph/blob/cc17681b478594aa39dd80437256a54e388432f0/src/rgw/rgw_dmclock_async_scheduler.h#L166-L214 <https://github.com/ceph/ceph/blob/cc17681b478594aa39dd80437256a54e388432f0/src/rgw/rgw_dmclock_async_scheduler.h#L166-L214>
>
> Does anyone see the same phenomenon? Could this be a bug in the request handling of RGW, or am I wrong in my assumptions?
>
> For now we’re just restarting our RGWs regularly, which seems to keep the problem at bay.
>
> Thanks for any hints.
>
> Denis
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Abhishek 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx