Re: [RFC PATCH 5/5] SUNRPC: Reduce the priority of the xprtiod queue

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 28 May 2019 15:03:56 -0400

Following up on this. Now with even more data!

> On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> 
> 
>> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@xxxxxxxxx> wrote:
>> 
>> Allow more time for softirqd
> 
> Have you thought about performance tests for this one?

I tested this series on my 12-core two-socket client using a variety
of tests including iozone, fio, and fstests. The network under test
is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA.

With lock debugging and memory leak testing enabled, I did not see
any functional regressions or new leaks or crashes. Thus IMO this
series is "safe to apply."

With TCP, I saw no change in performance between a "stock" kernel
and one with all five patches in this series applied, as, IIRC,
you predicted.

The following discussion is based on testing with NFS/RDMA.

With RDMA, I saw an improvement of 5-10% in IOPS rate between the
"stock" kernel and a kernel with the first four patches applied. When
the fifth patch is applied, I saw IOPS throughput significantly worse
than "stock" -- like 20% worse.

I also studied average RPC execution time (the "execute" metric) with
the "stock" kernel, the one with four patches applied, and with the
one where all five are applied. The workload is 100% 4KB READs with
an iodepth of 1024 in order to saturate the transmit queue.

With four patches, the execute time is about 2.5 msec faster (average
execution time is around 75 msec due to the large backlog this test
generates). With five patches, it's slower than "stock" by 12 msec.

I also saw a 30 usec improvement in the average latency of
xprt_complete_rqst with the four patch series.

As far as I can tell, the benefit of this series comes mostly from
the third patch, which changes spin_lock_bh(&xprt->transport_lock) to
spin_lock(&xprt->transport_lock). When the xprtiod work queue is
lowered in priority in 5/5, that benefit vanishes.

I am still confused about why 5/5 is needed. I did not see any soft
lockups without this patch applied when using RDMA. Is the issue
with xprtsock's use of xprtiod for handling incoming TCP receives?

I still have some things I'd like to look at. One thing I haven't
yet tried is looking at lock_stat, which would confirm or refute
my theory that this is all about the transport_lock, for instance.

>> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
>> ---
>> net/sunrpc/sched.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> index c7e81336620c..6b37c9a4b48f 100644
>> --- a/net/sunrpc/sched.c
>> +++ b/net/sunrpc/sched.c
>> @@ -1253,7 +1253,7 @@ static int rpciod_start(void)
>> 		goto out_failed;
>> 	rpciod_workqueue = wq;
>> 	/* Note: highpri because network receive is latency sensitive */
> 
> The above comment should be deleted as well.
> 
> 
>> -	wq = alloc_workqueue("xprtiod", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_HIGHPRI, 0);
>> +	wq = alloc_workqueue("xprtiod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
>> 	if (!wq)
>> 		goto free_rpciod;
>> 	xprtiod_workqueue = wq;
> 
> --
> Chuck Lever
> 
> 
> 

--
Chuck Lever