> On Feb 14, 2021, at 1:18 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Sun, 2021-02-14 at 17:58 +0000, Chuck Lever wrote: >> >> >>> On Feb 14, 2021, at 12:41 PM, Trond Myklebust < >>> trondmy@xxxxxxxxxxxxxxx> wrote: >>> >>> On Sun, 2021-02-14 at 17:27 +0000, Chuck Lever wrote: >>>> >>>> >>>>> On Feb 14, 2021, at 11:21 AM, Trond Myklebust >>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>> >>>>> On Sat, 2021-02-13 at 23:30 +0000, Chuck Lever wrote: >>>>>> >>>>>> >>>>>>> On Feb 13, 2021, at 5:10 PM, Trond Myklebust < >>>>>>> trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> On Sat, 2021-02-13 at 21:53 +0000, Chuck Lever wrote: >>>>>>>> Hi Trond- >>>>>>>> >>>>>>>>> On Feb 13, 2021, at 3:25 PM, trondmy@xxxxxxxxxx wrote: >>>>>>>>> >>>>>>>>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> >>>>>>>>> >>>>>>>>> Use a counter to keep track of how many requests are >>>>>>>>> queued >>>>>>>>> behind >>>>>>>>> the >>>>>>>>> xprt->xpt_mutex, and keep TCP_CORK set until the queue >>>>>>>>> is >>>>>>>>> empty. >>>>>>>> >>>>>>>> I'm intrigued, but IMO, the patch description needs to >>>>>>>> explain >>>>>>>> why this change should be made. Why abandon Nagle? >>>>>>>> >>>>>>> >>>>>>> This doesn't change the Nagle/TCP_NODELAY settings. It just >>>>>>> switches to >>>>>>> using the new documented kernel interface. >>>>>>> >>>>>>> The only change is to use TCP_CORK so that we don't send >>>>>>> out >>>>>>> partially >>>>>>> filled TCP frames, when we can see that there are other RPC >>>>>>> replies >>>>>>> that are queued up for transmission. >>>>>>> >>>>>>> Note the combination TCP_CORK+TCP_NODELAY is common, and >>>>>>> the >>>>>>> main >>>>>>> effect of the latter is that when we turn off the TCP_CORK, >>>>>>> then >>>>>>> there >>>>>>> is an immediate forced push of the TCP queue. >>>>>> >>>>>> The description above suggests the patch is just a >>>>>> clean-up, but a forced push has potential to change >>>>>> the server's behavior. >>>>> >>>>> Well, yes. That's very much the point. >>>>> >>>>> Right now, the TCP_NODELAY/Nagle setting means that we're doing >>>>> that >>>>> forced push at the end of _every_ RPC reply, whether or not >>>>> there >>>>> is >>>>> more stuff that can be queued up in the socket. The MSG_MORE is >>>>> the >>>>> only thing that keeps us from doing the forced push on every >>>>> sendpage() >>>>> call. >>>>> So the TCP_CORK is there to _further delay_ that forced push >>>>> until >>>>> we >>>>> think the queue is empty. >>>> >>>> My concern is that waiting for the queue to empty before pushing >>>> could >>>> improve throughput at the cost of increased average round-trip >>>> latency. >>>> That concern is based on experience I've had attempting to batch >>>> sends >>>> in the RDMA transport. >>>> >>>> >>>>> IOW: it attempts to optimise the scheduling of that push until >>>>> we're >>>>> actually done pushing more stuff into the socket. >>>> >>>> Yep, clear, thanks. It would help a lot if the above were >>>> included in >>>> the patch description. >>>> >>>> And, I presume that the TCP layer will push anyway if it needs to >>>> reclaim resources to handle more queued sends. >>>> >>>> Let's also consider starvation; ie, that the server will continue >>>> queuing replies such that it never uncorks. The logic in the >>>> patch >>>> appears to depend on the client stopping at some point to wait >>>> for >>>> the >>>> server to catch up. There probably should be a trap door that >>>> uncorks >>>> after a few requests (say, 8) or certain number of bytes are >>>> pending >>>> without a break. >>> >>> So, firstly, the TCP layer will still push every time a frame is >>> full, >>> so traffic does not stop altogether while TCP_CORK is set. >> >> OK. >> >> >>> Secondly, TCP will also always push on send errors (e.g. when out >>> of free socket >>> buffer). >> >> As I presumed. OK. >> >> >>> Thirdly, it will always push after hitting the 200ms ceiling, >>> as described in the tcp(7) manpage. >> >> That's the trap door we need to ensure no starvation or deadlock, >> assuming there are no bugs in the TCP layer's logic. >> >> 200ms seems a long time to wait, though, when average round trip >> latencies are usually under 1ms on typical Ethernet. It would be >> good to know how often sending has to wait this long. > > If it does wait that long, then it would be because the system is under > such load that scheduling the next task waiting for the xpt_mutex is > taking more than 200ms. I don't see how this would make things any > worse. Sure, but I'm speculating that some kind of metric or tracepoint might provide useful field diagnostic information. Not necessary for this particular patch, but someday, perhaps. >>> IOW: The TCP_CORK feature is not designed to block the socket >>> forever. >>> It is there to allow the application to hint to the TCP layer what >>> it >>> needs to do in exactly the same way that MSG_MORE does. >> >> As long as it is only a hint, then we're good. >> >> Out of interest, why not use only MSG_MORE, or remove the use >> of MSG_MORE in favor of only cork? If these are essentially the >> same mechanism, seems like we need only one or the other. >> > > The advantage of TCP_CORK is that you can perform the queue length > evaluation _after_ the sendmsg/sendpage/writev call. Since all of those > can block due to memory allocation, socket buffer shortage, etc, then > it is quite likely under many workloads that other RPC calls may have > time to finish processing and get queued up. > > I agree that we probably could remove MSG_MORE for tcp once TCP_CORK is > implemented. However those flags continue to be useful for udp. On the server, the UDP path uses xprt_sock_sendmsg(). The TCP path uses svc_tcp_sendmsg(). Separate call paths. Removing MSG_MORE only for TCP should be a straightforward simplification to svc_tcp_sendmsg(). -- Chuck Lever