Re: [PATCH] SUNRPC: Remove rpc_xprt::tsh_size

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 10 Jan 2019 12:13:21 -0500

> On Jan 4, 2019, at 5:44 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, 2019-01-04 at 16:35 -0500, Chuck Lever wrote:
>>> On Jan 3, 2019, at 11:00 PM, Trond Myklebust <
>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> On Thu, 2019-01-03 at 17:49 -0500, Chuck Lever wrote:
>>>>> On Jan 3, 2019, at 4:35 PM, Chuck Lever <chuck.lever@xxxxxxxxxx
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> On Jan 3, 2019, at 4:28 PM, Trond Myklebust <
>>>>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>> 
>>>>>> On Thu, 2019-01-03 at 16:07 -0500, Chuck Lever wrote:
>>>>>>>> On Jan 3, 2019, at 3:53 PM, Chuck Lever <
>>>>>>>> chuck.lever@xxxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> On Jan 3, 2019, at 1:47 PM, Trond Myklebust <
>>>>>>>>> trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>>>> 
>>>>>>>>> On Thu, 2019-01-03 at 13:29 -0500, Chuck Lever wrote:
>>>>>>>>>> +	reclen = req->rq_snd_buf.len;
>>>>>>>>>> +	marker = cpu_to_be32(RPC_LAST_STREAM_FRAGMENT |
>>>>>>>>>> reclen);
>>>>>>>>>> +	return kernel_sendmsg(transport->sock, &msg,
>>>>>>>>>> &iov, 1,
>>>>>>>>>> iov.iov_len);
>>>>>>>>> 
>>>>>>>>> So what does this do for performance? I'd expect that
>>>>>>>>> adding
>>>>>>>>> another
>>>>>>>>> dive into the socket layer will come with penalties.
>>>>>>>> 
>>>>>>>> NFSv3 on TCP, sec=sys, 56Gbs IBoIP, v4.20 + my v4.21
>>>>>>>> patches
>>>>>>>> fio, 8KB random, 70% read, 30% write, 16 threads,
>>>>>>>> iodepth=16
>>>>>>>> 
>>>>>>>> Without this patch:
>>>>>>>> 
>>>>>>>> read: IOPS=28.7k, BW=224MiB/s
>>>>>>>> (235MB/s)(11.2GiB/51092msec)
>>>>>>>> write: IOPS=12.3k, BW=96.3MiB/s
>>>>>>>> (101MB/s)(4918MiB/51092msec)
>>>>>>>> 
>>>>>>>> With this patch:
>>>>>>>> 
>>>>>>>> read: IOPS=28.6k, BW=224MiB/s
>>>>>>>> (235MB/s)(11.2GiB/51276msec)
>>>>>>>> write: IOPS=12.3k, BW=95.8MiB/s
>>>>>>>> (100MB/s)(4914MiB/51276msec)
>>>>>>>> 
>>>>>>>> Seems like that's in the noise.
>>>>>>> 
>>>>>>> Sigh. That's because it was the same kernel. Again, with
>>>>>>> feeling:
>>>>>>> 
>>>>>>> 4.20.0-rc7-00048-g9274254:
>>>>>>> read: IOPS=28.6k, BW=224MiB/s (235MB/s)(11.2GiB/51276msec)
>>>>>>> write: IOPS=12.3k, BW=95.8MiB/s
>>>>>>> (100MB/s)(4914MiB/51276msec)
>>>>>>> 
>>>>>>> 4.20.0-rc7-00049-ga4dea15:
>>>>>>> read: IOPS=27.2k, BW=212MiB/s (223MB/s)(11.2GiB/53979msec)
>>>>>>> write: IOPS=11.7k, BW=91.1MiB/s
>>>>>>> (95.5MB/s)(4917MiB/53979msec)
>>>>>>> 
>>>>>> 
>>>>>> So about a 5% reduction in performance?
>>>>> 
>>>>> On this workload, yes.
>>>>> 
>>>>> Could send the record marker in xs_send_kvec with the head[0]
>>>>> iovec.
>>>>> I'm going to try that next.
>>>> 
>>>> That helps:
>>>> 
>>>> Linux 4.20.0-rc7-00049-g664f679 #651 SMP Thu Jan 3 17:35:26 EST
>>>> 2019
>>>> 
>>>>  read: IOPS=28.7k, BW=224MiB/s (235MB/s)(11.2GiB/51185msec)
>>>> write: IOPS=12.3k, BW=96.1MiB/s (101MB/s)(4919MiB/51185msec)
>>>> 
>>> 
>>> Interesting... Perhaps we might be able to eke out a few more
>>> percent
>>> performance on file writes by also converting xs_send_pagedata() to
>>> use
>>> a single sock_sendmsg() w/ iov_iter rather than looping through
>>> several
>>> calls to sendpage()?
>> 
>> IMO...
>> 
>> For small requests (say, smaller than 17 pages), packing the head,
>> pagevec,
>> and tail into an iov_iter and sending them all via a single
>> sock_sendmsg
>> call would likely be efficient.
>> 
>> For larger requests, other overheads would dominate. And you'd have
>> to keep around an iter array that held 257 entries... You could pass
>> a
>> large pagevec to sock_sendmsg in smaller chunks.
>> 
>> Are you thinking of converting xs_sendpages (or even xdr_bufs) to use
>> iov_iter directly?
> 
> For now, I was thinking of just converting xs_sendpages to call
> xdr_alloc_bvec(), and then do the equivalent of what xs_read_bvec()
> does for receives today.
> 
> The next step is to convert xdr_bufs to use bvecs natively instead of
> having to allocate them to shadow the array of pages. I believe someone
> was working on allowing a single bvec to take an array of pages
> (containing contiguous data), which would make that conversion almost
> trivial.
> 
> The final step would be to do as you say, to pack the kvecs into the
> same call to sock_sendmsg() as the bvecs. We might imagine adding a new
> type of iov_iter that can iterate over an array of struct iov_iter in
> order to deal with this case?

The same approach might help svc_send_common, and could be an easier
first step.

--
Chuck Lever