> On Jan 4, 2019, at 5:44 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Fri, 2019-01-04 at 16:35 -0500, Chuck Lever wrote: >>> On Jan 3, 2019, at 11:00 PM, Trond Myklebust < >>> trondmy@xxxxxxxxxxxxxxx> wrote: >>> >>> On Thu, 2019-01-03 at 17:49 -0500, Chuck Lever wrote: >>>>> On Jan 3, 2019, at 4:35 PM, Chuck Lever <chuck.lever@xxxxxxxxxx >>>>>> >>>>> wrote: >>>>> >>>>>> On Jan 3, 2019, at 4:28 PM, Trond Myklebust < >>>>>> trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>> On Thu, 2019-01-03 at 16:07 -0500, Chuck Lever wrote: >>>>>>>> On Jan 3, 2019, at 3:53 PM, Chuck Lever < >>>>>>>> chuck.lever@xxxxxxxxxx> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On Jan 3, 2019, at 1:47 PM, Trond Myklebust < >>>>>>>>> trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> On Thu, 2019-01-03 at 13:29 -0500, Chuck Lever wrote: >>>>>>>>>> + reclen = req->rq_snd_buf.len; >>>>>>>>>> + marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT | >>>>>>>>>> reclen); >>>>>>>>>> + return kernel_sendmsg(transport->sock, &msg, >>>>>>>>>> &iov, 1, >>>>>>>>>> iov.iov_len); >>>>>>>>> >>>>>>>>> So what does this do for performance? I'd expect that >>>>>>>>> adding >>>>>>>>> another >>>>>>>>> dive into the socket layer will come with penalties. >>>>>>>> >>>>>>>> NFSv3 on TCP, sec=sys, 56Gbs IBoIP, v4.20 + my v4.21 >>>>>>>> patches >>>>>>>> fio, 8KB random, 70% read, 30% write, 16 threads, >>>>>>>> iodepth=16 >>>>>>>> >>>>>>>> Without this patch: >>>>>>>> >>>>>>>> read: IOPS=28.7k, BW=224MiB/s >>>>>>>> (235MB/s)(11.2GiB/51092msec) >>>>>>>> write: IOPS=12.3k, BW=96.3MiB/s >>>>>>>> (101MB/s)(4918MiB/51092msec) >>>>>>>> >>>>>>>> With this patch: >>>>>>>> >>>>>>>> read: IOPS=28.6k, BW=224MiB/s >>>>>>>> (235MB/s)(11.2GiB/51276msec) >>>>>>>> write: IOPS=12.3k, BW=95.8MiB/s >>>>>>>> (100MB/s)(4914MiB/51276msec) >>>>>>>> >>>>>>>> Seems like that's in the noise. >>>>>>> >>>>>>> Sigh. That's because it was the same kernel. Again, with >>>>>>> feeling: >>>>>>> >>>>>>> 4.20.0-rc7-00048-g9274254: >>>>>>> read: IOPS=28.6k, BW=224MiB/s (235MB/s)(11.2GiB/51276msec) >>>>>>> write: IOPS=12.3k, BW=95.8MiB/s >>>>>>> (100MB/s)(4914MiB/51276msec) >>>>>>> >>>>>>> 4.20.0-rc7-00049-ga4dea15: >>>>>>> read: IOPS=27.2k, BW=212MiB/s (223MB/s)(11.2GiB/53979msec) >>>>>>> write: IOPS=11.7k, BW=91.1MiB/s >>>>>>> (95.5MB/s)(4917MiB/53979msec) >>>>>>> >>>>>> >>>>>> So about a 5% reduction in performance? >>>>> >>>>> On this workload, yes. >>>>> >>>>> Could send the record marker in xs_send_kvec with the head[0] >>>>> iovec. >>>>> I'm going to try that next. >>>> >>>> That helps: >>>> >>>> Linux 4.20.0-rc7-00049-g664f679 #651 SMP Thu Jan 3 17:35:26 EST >>>> 2019 >>>> >>>> read: IOPS=28.7k, BW=224MiB/s (235MB/s)(11.2GiB/51185msec) >>>> write: IOPS=12.3k, BW=96.1MiB/s (101MB/s)(4919MiB/51185msec) >>>> >>> >>> Interesting... Perhaps we might be able to eke out a few more >>> percent >>> performance on file writes by also converting xs_send_pagedata() to >>> use >>> a single sock_sendmsg() w/ iov_iter rather than looping through >>> several >>> calls to sendpage()? >> >> IMO... >> >> For small requests (say, smaller than 17 pages), packing the head, >> pagevec, >> and tail into an iov_iter and sending them all via a single >> sock_sendmsg >> call would likely be efficient. >> >> For larger requests, other overheads would dominate. And you'd have >> to keep around an iter array that held 257 entries... You could pass >> a >> large pagevec to sock_sendmsg in smaller chunks. >> >> Are you thinking of converting xs_sendpages (or even xdr_bufs) to use >> iov_iter directly? > > For now, I was thinking of just converting xs_sendpages to call > xdr_alloc_bvec(), and then do the equivalent of what xs_read_bvec() > does for receives today. > > The next step is to convert xdr_bufs to use bvecs natively instead of > having to allocate them to shadow the array of pages. I believe someone > was working on allowing a single bvec to take an array of pages > (containing contiguous data), which would make that conversion almost > trivial. > > The final step would be to do as you say, to pack the kvecs into the > same call to sock_sendmsg() as the bvecs. We might imagine adding a new > type of iov_iter that can iterate over an array of struct iov_iter in > order to deal with this case? The same approach might help svc_send_common, and could be an easier first step. -- Chuck Lever