On Sat, 2021-02-13 at 23:30 +0000, Chuck Lever wrote: > > > > On Feb 13, 2021, at 5:10 PM, Trond Myklebust < > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Sat, 2021-02-13 at 21:53 +0000, Chuck Lever wrote: > > > Hi Trond- > > > > > > > On Feb 13, 2021, at 3:25 PM, trondmy@xxxxxxxxxx wrote: > > > > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > Use a counter to keep track of how many requests are queued > > > > behind > > > > the > > > > xprt->xpt_mutex, and keep TCP_CORK set until the queue is > > > > empty. > > > > > > I'm intrigued, but IMO, the patch description needs to explain > > > why this change should be made. Why abandon Nagle? > > > > > > > This doesn't change the Nagle/TCP_NODELAY settings. It just > > switches to > > using the new documented kernel interface. > > > > The only change is to use TCP_CORK so that we don't send out > > partially > > filled TCP frames, when we can see that there are other RPC replies > > that are queued up for transmission. > > > > Note the combination TCP_CORK+TCP_NODELAY is common, and the main > > effect of the latter is that when we turn off the TCP_CORK, then > > there > > is an immediate forced push of the TCP queue. > > The description above suggests the patch is just a > clean-up, but a forced push has potential to change > the server's behavior. Well, yes. That's very much the point. Right now, the TCP_NODELAY/Nagle setting means that we're doing that forced push at the end of _every_ RPC reply, whether or not there is more stuff that can be queued up in the socket. The MSG_MORE is the only thing that keeps us from doing the forced push on every sendpage() call. So the TCP_CORK is there to _further delay_ that forced push until we think the queue is empty. IOW: it attempts to optimise the scheduling of that push until we're actually done pushing more stuff into the socket. > I'm trying to assess what kind of additional testing > would be valuable. > > > > > If you expect a performance impact, the description should > > > provide metrics to show it. > > > > I don't have a performance system to measure the improvement > > accurately. > > Then let's have Daire try it out, if possible. > > > > However I am seeing an notable improvement with the > > equivalent client change. Specifically, xfstests generic/127 shows > > a > > ~20% improvement on my VM based test rig. > > > > (We should have Daire try this change with his multi-client > > > setup). > > > > > > > > > > Signed-off-by: Trond Myklebust > > > > <trond.myklebust@xxxxxxxxxxxxxxx> > > > > --- > > > > include/linux/sunrpc/svcsock.h | 2 ++ > > > > net/sunrpc/svcsock.c | 8 +++++++- > > > > 2 files changed, 9 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/include/linux/sunrpc/svcsock.h > > > > b/include/linux/sunrpc/svcsock.h > > > > index b7ac7fe68306..bcc555c7ae9c 100644 > > > > --- a/include/linux/sunrpc/svcsock.h > > > > +++ b/include/linux/sunrpc/svcsock.h > > > > @@ -35,6 +35,8 @@ struct svc_sock { > > > > /* Total length of the data (not including fragment > > > > headers) > > > > * received so far in the fragments making up this rpc: > > > > */ > > > > u32 sk_datalen; > > > > + /* Number of queued send requests */ > > > > + atomic_t sk_sendqlen; > > > > > > Can you take advantage of xpt_mutex: update this field > > > only in the critical section, and make it a simple > > > integer type? > > I see why atomic_inc(&sk_sendqlen); has to happen outside > the mutex. However, it seems a little crazy/costly to have > both of these serialization primitives. > > I haven't found a mutex_unlock() that will report that there > are no waiters to wake up, though. > > > > > > struct page * sk_pages[RPCSVC_MAXPAGES]; > > > > /* > > > > received data */ > > > > }; > > > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > > > > index 5a809c64dc7b..231f510a4830 100644 > > > > --- a/net/sunrpc/svcsock.c > > > > +++ b/net/sunrpc/svcsock.c > > > > @@ -1171,18 +1171,23 @@ static int svc_tcp_sendto(struct > > > > svc_rqst > > > > *rqstp) > > > > > > > > svc_tcp_release_rqst(rqstp); > > > > > > > > + atomic_inc(&svsk->sk_sendqlen); > > > > mutex_lock(&xprt->xpt_mutex); > > > > if (svc_xprt_is_dead(xprt)) > > > > goto out_notconn; > > > > + tcp_sock_set_cork(svsk->sk_sk, true); > > > > err = svc_tcp_sendmsg(svsk->sk_sock, &msg, xdr, marker, > > > > &sent); > > > > xdr_free_bvec(xdr); > > > > trace_svcsock_tcp_send(xprt, err < 0 ? err : sent); > > > > if (err < 0 || sent != (xdr->len + sizeof(marker))) > > > > goto out_close; > > > > + if (atomic_dec_and_test(&svsk->sk_sendqlen)) > > > > + tcp_sock_set_cork(svsk->sk_sk, false); > > > > mutex_unlock(&xprt->xpt_mutex); > > > > return sent; > > > > > > > > out_notconn: > > > > + atomic_dec(&svsk->sk_sendqlen); > > > > mutex_unlock(&xprt->xpt_mutex); > > > > return -ENOTCONN; > > > > out_close: > > > > @@ -1192,6 +1197,7 @@ static int svc_tcp_sendto(struct svc_rqst > > > > *rqstp) > > > > (err < 0) ? err : sent, xdr->len); > > > > set_bit(XPT_CLOSE, &xprt->xpt_flags); > > > > svc_xprt_enqueue(xprt); > > > > + atomic_dec(&svsk->sk_sendqlen); > > > > mutex_unlock(&xprt->xpt_mutex); > > > > return -EAGAIN; > > > > } > > > > @@ -1261,7 +1267,7 @@ static void svc_tcp_init(struct svc_sock > > > > *svsk, struct svc_serv *serv) > > > > svsk->sk_datalen = 0; > > > > memset(&svsk->sk_pages[0], 0, sizeof(svsk- > > > > > sk_pages)); > > > > > > > > - tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF; > > > > + tcp_sock_set_nodelay(sk); > > > > > > > > set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > > > > switch (sk->sk_state) { > > > > -- > > > > 2.29.2 > > > > > > > > > > -- > > > Chuck Lever > > > > > > > > > > > > > -- > > Trond Myklebust > > Linux NFS client maintainer, Hammerspace > > trond.myklebust@xxxxxxxxxxxxxxx > > -- > Chuck Lever > > > -- Trond Myklebust CTO, Hammerspace Inc 4984 El Camino Real, Suite 208 Los Altos, CA 94022 www.hammer.space