> On Nov 9, 2020, at 3:10 PM, Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote: > > > > On 11/9/20 8:31 PM, Chuck Lever wrote: >> >> >>> On Nov 9, 2020, at 1:16 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: >>> >>> On Mon, 2020-11-09 at 12:36 -0500, Chuck Lever wrote: >>>> >>>> >>>>> On Nov 9, 2020, at 12:32 PM, Trond Myklebust < >>>>> trondmy@xxxxxxxxxxxxxxx> wrote: >>>>> >>>>> On Mon, 2020-11-09 at 12:12 -0500, Chuck Lever wrote: >>>>>> >>>>>> >>>>>>> On Nov 9, 2020, at 12:08 PM, Trond Myklebust >>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote: >>>>>>>> Daire Byrne reports a ~50% aggregrate throughput regression >>>>>>>> on >>>>>>>> his >>>>>>>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach >>>>>>>> server >>>>>>>> to >>>>>>>> use xprt_sock_sendmsg for socket sends"), which replaced >>>>>>>> kernel_send_page() calls in NFSD's socket send path with >>>>>>>> calls to >>>>>>>> sock_sendmsg() using iov_iter. >>>>>>>> >>>>>>>> Investigation showed that tcp_sendmsg() was not using zero- >>>>>>>> copy >>>>>>>> to >>>>>>>> send the xdr_buf's bvec pages, but instead was relying on >>>>>>>> memcpy. >>>>>>>> >>>>>>>> Set up the socket and each msghdr that bears bvec pages to >>>>>>>> use >>>>>>>> the >>>>>>>> zero-copy mechanism in tcp_sendmsg. >>>>>>>> >>>>>>>> Reported-by: Daire Byrne <daire@xxxxxxxx> >>>>>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439 >>>>>>>> Fixes: da1661b93bf4 ("SUNRPC: Teach server to use >>>>>>>> xprt_sock_sendmsg >>>>>>>> for socket sends") >>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> >>>>>>>> --- >>>>>>>> net/sunrpc/socklib.c | 5 ++++- >>>>>>>> net/sunrpc/svcsock.c | 1 + >>>>>>>> net/sunrpc/xprtsock.c | 1 + >>>>>>>> 3 files changed, 6 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> This patch does not fully resolve the issue. Daire reports >>>>>>>> high >>>>>>>> softIRQ activity after the patch is applied, and this >>>>>>>> activity >>>>>>>> seems to prevent full restoration of previous performance. >>>>>>>> >>>>>>>> >>>>>>>> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c >>>>>>>> index d52313af82bc..af47596a7bdd 100644 >>>>>>>> --- a/net/sunrpc/socklib.c >>>>>>>> +++ b/net/sunrpc/socklib.c >>>>>>>> @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct >>>>>>>> socket >>>>>>>> *sock, struct msghdr *msg, >>>>>>>> if (err < 0) >>>>>>>> return err; >>>>>>>> >>>>>>>> + msg->msg_flags |= MSG_ZEROCOPY; >>>>>>>> iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec, >>>>>>>> xdr_buf_pagecount(xdr), >>>>>>>> xdr->page_len + xdr->page_base); >>>>>>>> - return xprt_sendmsg(sock, msg, base + xdr- >>>>>>>>> page_base); >>>>>>>> + err = xprt_sendmsg(sock, msg, base + xdr->page_base); >>>>>>>> + msg->msg_flags &= ~MSG_ZEROCOPY; >>>>>>>> + return err; >>>>>>>> } >>>>>>>> >>>>>>>> /* Common case: >>>>>>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c >>>>>>>> index c2752e2b9ce3..c814b4953b15 100644 >>>>>>>> --- a/net/sunrpc/svcsock.c >>>>>>>> +++ b/net/sunrpc/svcsock.c >>>>>>>> @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct >>>>>>>> svc_sock >>>>>>>> *svsk, >>>>>>>> struct svc_serv *serv) >>>>>>>> svsk->sk_datalen = 0; >>>>>>>> memset(&svsk->sk_pages[0], 0, sizeof(svsk- >>>>>>>>> sk_pages)); >>>>>>>> >>>>>>>> + sock_set_flag(sk, SOCK_ZEROCOPY); >>>>>>>> tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF; >>>>>>>> >>>>>>>> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); >>>>>>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c >>>>>>>> index 7090bbee0ec5..343c6396b297 100644 >>>>>>>> --- a/net/sunrpc/xprtsock.c >>>>>>>> +++ b/net/sunrpc/xprtsock.c >>>>>>>> @@ -2175,6 +2175,7 @@ static int >>>>>>>> xs_tcp_finish_connecting(struct >>>>>>>> rpc_xprt *xprt, struct socket *sock) >>>>>>>> >>>>>>>> /* socket options */ >>>>>>>> sock_reset_flag(sk, SOCK_LINGER); >>>>>>>> + sock_set_flag(sk, SOCK_ZEROCOPY); >>>>>>>> tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF; >>>>>>>> >>>>>>>> xprt_clear_connected(xprt); >>>>>>>> >>>>>>>> >>>>>>> I'm thinking we are not really allowed to do that here. The >>>>>>> pages >>>>>>> we >>>>>>> pass in to the RPC layer are not guaranteed to contain stable >>>>>>> data >>>>>>> since they include unlocked page cache pages as well as >>>>>>> O_DIRECT >>>>>>> pages. >>>>>> >>>>>> I assume you mean the client side only. Those issues aren't a >>>>>> factor >>>>>> on the server. Not setting SOCK_ZEROCOPY here should be enough to >>>>>> prevent the use of zero-copy on the client. >>>>>> >>>>>> However, the client loses the benefits of sending a page at a >>>>>> time. >>>>>> Is there a desire to remedy that somehow? >>>>> >>>>> What about splice reads on the server side? >>>> >>>> On the server, this path formerly used kernel_sendpages(), which I >>>> assumed is similar to the sendmsg zero-copy mechanism. How does >>>> kernel_sendpages() mitigate against page instability? >>> >>> It copies the data. 🙂 >> >> tcp_sendmsg_locked() invokes skb_copy_to_page_nocache(), which is >> where Daire's performance-robbing memcpy occurs. >> >> do_tcp_sendpages() has no such call site. Therefore the legacy >> sendpage-based path has at least one fewer data copy operations. >> >> What is the appropriate way to make tcp_sendmsg() treat a bvec-bearing >> msghdr like an array of struct page pointers passed to kernel_sendpage() ? >> > > > MSG_ZEROCOPY is only accepted if sock_flag(sk, SOCK_ZEROCOPY) is true, > ie if SO_ZEROCOPY socket option has been set earlier. Eric, are you suggesting that ZEROCOPY is the mechanism that socket consumers should be using with sock_sendmsg to get the same behavior as kernel_sendpage() ? If no, what is the preferred approach? If yes, can you comment on the added soft IRQ workload when NFSD sets these flags? -- Chuck Lever