Re: RX throttling causes keep-alive timeout

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Wed, 4 Aug 2021 17:56:31 +0100

On Wed, Aug 04, 2021 at 04:16:10PM +0000, Ivan Teterevkov wrote:
> Hello folks,
> 
> I recently discovered that the max_client_requests configuration option
> affects the keep-alive RPC and can cause the connection timeout, and I
> wanted to verify if my understanding is correct.
> 
> Let me outline the context. Under certain circumstances, one of the
> connected clients in my setup issued multiple concurrent long-standing
> live-migration requests and reached the default limit of five concurrent
> requests. Consequently, it triggered the RX throttling, so the server
> stopped reading the incoming data from this client's file descriptor.
> Meanwhile, the server issued the keepalive "ping" requests but ignored
> the "pong" responses from the client due to the RX throttling. As a result,
> the server concluded the client was dead and closed the connection with the
> "connection closed due to keepalive timeout" message after the default five
> "ping" attempts with five seconds timeout each.
> 
> The idea of throttling makes perfect sense: the server prevents hogging
> of the worker thread pool (or would prevent the unbounded growth of the
> memory footprint if the libvirtd server continued parsing the incoming
> data and queued the requests). What concerns me is that the server drops
> the connection for the alive clients when they're throttled.

Note the limiting happens before the parsing - we don't even read
the data off the socket when we are rate limited, as we don't want
our in memory pending queue growing unbounded.

> One approach to this problem is implementing the EAGAIN-like handling:
> parse the incoming data above the limit and respond with the error response,
> but handle the keep-alive RPCs gracefully. However, I see two problems here:
> either it's a backwards-compatibility concern if implemented unconditionally
> or polluting the configuration space if implemented conditionally.

There is no way to parse the keepalives, without also pulling all
the preceeding data off the socket, which defeats the purpose of
having the limit.

> What is the community's opinion on the above issue?

IMHO this is a tuning problem for your application. If you are expecting
to have 5 long running operations happening concurrently in normal usage,
then you should have increased the max_client_requests parameter to a
value greater than 5 to give yourself more headroom.

The right limit is hard to suggest without knowing more about your mgmt
application. As an example though, if your application can potentially
have 2 API calls pending per running VM and your host capacity allows
for 100 VMs, then you might plan for your max_client_requests value
to be 200. Having a big value for max_client_requests is not inherantly
a bad thing - we just want to prevent unbounded growth when things go
wrong.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|