On Fri, 2019-01-04 at 16:35 -0500, Chuck Lever wrote: > > On Jan 3, 2019, at 11:00 PM, Trond Myklebust < > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Thu, 2019-01-03 at 17:49 -0500, Chuck Lever wrote: > > > > On Jan 3, 2019, at 4:35 PM, Chuck Lever <chuck.lever@xxxxxxxxxx > > > > > > > > > wrote: > > > > > > > > > On Jan 3, 2019, at 4:28 PM, Trond Myklebust < > > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > On Thu, 2019-01-03 at 16:07 -0500, Chuck Lever wrote: > > > > > > > On Jan 3, 2019, at 3:53 PM, Chuck Lever < > > > > > > > chuck.lever@xxxxxxxxxx> > > > > > > > wrote: > > > > > > > > > > > > > > > On Jan 3, 2019, at 1:47 PM, Trond Myklebust < > > > > > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > On Thu, 2019-01-03 at 13:29 -0500, Chuck Lever wrote: > > > > > > > > > + reclen = req->rq_snd_buf.len; > > > > > > > > > + marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | > > > > > > > > > reclen); > > > > > > > > > + return kernel_sendmsg(transport->sock, &msg, > > > > > > > > > &iov, 1, > > > > > > > > > iov.iov_len); > > > > > > > > > > > > > > > > So what does this do for performance? I'd expect that > > > > > > > > adding > > > > > > > > another > > > > > > > > dive into the socket layer will come with penalties. > > > > > > > > > > > > > > NFSv3 on TCP, sec=sys, 56Gbs IBoIP, v4.20 + my v4.21 > > > > > > > patches > > > > > > > fio, 8KB random, 70% read, 30% write, 16 threads, > > > > > > > iodepth=16 > > > > > > > > > > > > > > Without this patch: > > > > > > > > > > > > > > read: IOPS=28.7k, BW=224MiB/s > > > > > > > (235MB/s)(11.2GiB/51092msec) > > > > > > > write: IOPS=12.3k, BW=96.3MiB/s > > > > > > > (101MB/s)(4918MiB/51092msec) > > > > > > > > > > > > > > With this patch: > > > > > > > > > > > > > > read: IOPS=28.6k, BW=224MiB/s > > > > > > > (235MB/s)(11.2GiB/51276msec) > > > > > > > write: IOPS=12.3k, BW=95.8MiB/s > > > > > > > (100MB/s)(4914MiB/51276msec) > > > > > > > > > > > > > > Seems like that's in the noise. > > > > > > > > > > > > Sigh. That's because it was the same kernel. Again, with > > > > > > feeling: > > > > > > > > > > > > 4.20.0-rc7-00048-g9274254: > > > > > > read: IOPS=28.6k, BW=224MiB/s (235MB/s)(11.2GiB/51276msec) > > > > > > write: IOPS=12.3k, BW=95.8MiB/s > > > > > > (100MB/s)(4914MiB/51276msec) > > > > > > > > > > > > 4.20.0-rc7-00049-ga4dea15: > > > > > > read: IOPS=27.2k, BW=212MiB/s (223MB/s)(11.2GiB/53979msec) > > > > > > write: IOPS=11.7k, BW=91.1MiB/s > > > > > > (95.5MB/s)(4917MiB/53979msec) > > > > > > > > > > > > > > > > So about a 5% reduction in performance? > > > > > > > > On this workload, yes. > > > > > > > > Could send the record marker in xs_send_kvec with the head[0] > > > > iovec. > > > > I'm going to try that next. > > > > > > That helps: > > > > > > Linux 4.20.0-rc7-00049-g664f679 #651 SMP Thu Jan 3 17:35:26 EST > > > 2019 > > > > > > read: IOPS=28.7k, BW=224MiB/s (235MB/s)(11.2GiB/51185msec) > > > write: IOPS=12.3k, BW=96.1MiB/s (101MB/s)(4919MiB/51185msec) > > > > > > > Interesting... Perhaps we might be able to eke out a few more > > percent > > performance on file writes by also converting xs_send_pagedata() to > > use > > a single sock_sendmsg() w/ iov_iter rather than looping through > > several > > calls to sendpage()? > > IMO... > > For small requests (say, smaller than 17 pages), packing the head, > pagevec, > and tail into an iov_iter and sending them all via a single > sock_sendmsg > call would likely be efficient. > > For larger requests, other overheads would dominate. And you'd have > to keep around an iter array that held 257 entries... You could pass > a > large pagevec to sock_sendmsg in smaller chunks. > > Are you thinking of converting xs_sendpages (or even xdr_bufs) to use > iov_iter directly? For now, I was thinking of just converting xs_sendpages to call xdr_alloc_bvec(), and then do the equivalent of what xs_read_bvec() does for receives today. The next step is to convert xdr_bufs to use bvecs natively instead of having to allocate them to shadow the array of pages. I believe someone was working on allowing a single bvec to take an array of pages (containing contiguous data), which would make that conversion almost trivial. The final step would be to do as you say, to pack the kvecs into the same call to sock_sendmsg() as the bvecs. We might imagine adding a new type of iov_iter that can iterate over an array of struct iov_iter in order to deal with this case? -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx