Re: [PATCH] SUNRPC: Use TCP_CORK to optimise send performance on the server

Chuck Lever <chuck.lever@xxxxxxxxxx> · Sun, 14 Feb 2021 19:43:03 +0000

> On Feb 14, 2021, at 1:18 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Sun, 2021-02-14 at 17:58 +0000, Chuck Lever wrote:
>> 
>> 
>>> On Feb 14, 2021, at 12:41 PM, Trond Myklebust < 
>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> On Sun, 2021-02-14 at 17:27 +0000, Chuck Lever wrote:
>>>> 
>>>> 
>>>>> On Feb 14, 2021, at 11:21 AM, Trond Myklebust
>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> On Sat, 2021-02-13 at 23:30 +0000, Chuck Lever wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Feb 13, 2021, at 5:10 PM, Trond Myklebust <
>>>>>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> On Sat, 2021-02-13 at 21:53 +0000, Chuck Lever wrote:
>>>>>>>> Hi Trond-
>>>>>>>> 
>>>>>>>>> On Feb 13, 2021, at 3:25 PM, trondmy@xxxxxxxxxx wrote:
>>>>>>>>> 
>>>>>>>>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
>>>>>>>>> 
>>>>>>>>> Use a counter to keep track of how many requests are
>>>>>>>>> queued
>>>>>>>>> behind
>>>>>>>>> the
>>>>>>>>> xprt->xpt_mutex, and keep TCP_CORK set until the queue
>>>>>>>>> is
>>>>>>>>> empty.
>>>>>>>> 
>>>>>>>> I'm intrigued, but IMO, the patch description needs to
>>>>>>>> explain
>>>>>>>> why this change should be made. Why abandon Nagle?
>>>>>>>> 
>>>>>>> 
>>>>>>> This doesn't change the Nagle/TCP_NODELAY settings. It just
>>>>>>> switches to
>>>>>>> using the new documented kernel interface.
>>>>>>> 
>>>>>>> The only change is to use TCP_CORK so that we don't send
>>>>>>> out
>>>>>>> partially
>>>>>>> filled TCP frames, when we can see that there are other RPC
>>>>>>> replies
>>>>>>> that are queued up for transmission.
>>>>>>> 
>>>>>>> Note the combination TCP_CORK+TCP_NODELAY is common, and
>>>>>>> the
>>>>>>> main
>>>>>>> effect of the latter is that when we turn off the TCP_CORK,
>>>>>>> then
>>>>>>> there
>>>>>>> is an immediate forced push of the TCP queue.
>>>>>> 
>>>>>> The description above suggests the patch is just a
>>>>>> clean-up, but a forced push has potential to change
>>>>>> the server's behavior.
>>>>> 
>>>>> Well, yes. That's very much the point.
>>>>> 
>>>>> Right now, the TCP_NODELAY/Nagle setting means that we're doing
>>>>> that
>>>>> forced push at the end of _every_ RPC reply, whether or not
>>>>> there
>>>>> is
>>>>> more stuff that can be queued up in the socket. The MSG_MORE is
>>>>> the
>>>>> only thing that keeps us from doing the forced push on every
>>>>> sendpage()
>>>>> call.
>>>>> So the TCP_CORK is there to _further delay_ that forced push
>>>>> until
>>>>> we
>>>>> think the queue is empty.
>>>> 
>>>> My concern is that waiting for the queue to empty before pushing
>>>> could
>>>> improve throughput at the cost of increased average round-trip
>>>> latency.
>>>> That concern is based on experience I've had attempting to batch
>>>> sends
>>>> in the RDMA transport.
>>>> 
>>>> 
>>>>> IOW: it attempts to optimise the scheduling of that push until
>>>>> we're
>>>>> actually done pushing more stuff into the socket.
>>>> 
>>>> Yep, clear, thanks. It would help a lot if the above were
>>>> included in
>>>> the patch description.
>>>> 
>>>> And, I presume that the TCP layer will push anyway if it needs to
>>>> reclaim resources to handle more queued sends.
>>>> 
>>>> Let's also consider starvation; ie, that the server will continue
>>>> queuing replies such that it never uncorks. The logic in the
>>>> patch
>>>> appears to depend on the client stopping at some point to wait
>>>> for
>>>> the
>>>> server to catch up. There probably should be a trap door that
>>>> uncorks
>>>> after a few requests (say, 8) or certain number of bytes are
>>>> pending
>>>> without a break.
>>> 
>>> So, firstly, the TCP layer will still push every time a frame is
>>> full,
>>> so traffic does not stop altogether while TCP_CORK is set.
>> 
>> OK.
>> 
>> 
>>> Secondly, TCP will also always push on send errors (e.g. when out
>>> of free socket
>>> buffer).
>> 
>> As I presumed. OK.
>> 
>> 
>>> Thirdly, it will always push after hitting the 200ms ceiling,
>>> as described in the tcp(7) manpage.
>> 
>> That's the trap door we need to ensure no starvation or deadlock,
>> assuming there are no bugs in the TCP layer's logic.
>> 
>> 200ms seems a long time to wait, though, when average round trip
>> latencies are usually under 1ms on typical Ethernet. It would be
>> good to know how often sending has to wait this long.
> 
> If it does wait that long, then it would be because the system is under
> such load that scheduling the next task waiting for the xpt_mutex is
> taking more than 200ms. I don't see how this would make things any
> worse.

Sure, but I'm speculating that some kind of metric or tracepoint might
provide useful field diagnostic information. Not necessary for this
particular patch, but someday, perhaps.

>>> IOW: The TCP_CORK feature is not designed to block the socket
>>> forever.
>>> It is there to allow the application to hint to the TCP layer what
>>> it
>>> needs to do in exactly the same way that MSG_MORE does.
>> 
>> As long as it is only a hint, then we're good.
>> 
>> Out of interest, why not use only MSG_MORE, or remove the use
>> of MSG_MORE in favor of only cork? If these are essentially the
>> same mechanism, seems like we need only one or the other.
>> 
> 
> The advantage of TCP_CORK is that you can perform the queue length
> evaluation _after_ the sendmsg/sendpage/writev call. Since all of those
> can block due to memory allocation, socket buffer shortage, etc, then
> it is quite likely under many workloads that other RPC calls may have
> time to finish processing and get queued up.
> 
> I agree that we probably could remove MSG_MORE for tcp once TCP_CORK is
> implemented. However those flags continue to be useful for udp.

On the server, the UDP path uses xprt_sock_sendmsg(). The TCP path uses
svc_tcp_sendmsg(). Separate call paths. Removing MSG_MORE only for TCP
should be a straightforward simplification to svc_tcp_sendmsg().

--
Chuck Lever