Re: [RFC] fix parallelism for rpc tasks

Chuck Lever <chuck.lever@xxxxxxxxxx> · Sat, 17 Feb 2018 13:55:08 -0500

> On Feb 14, 2018, at 6:13 PM, Mora, Jorge <Jorge.Mora@xxxxxxxxxx> wrote:
> 
> Hello,
> 
> The patch gives some performance improvement on Kerberos read.
> The following results show performance comparisons between unpatched
> and patched systems. The html files included as attachments show the
> results as line charts.
> 
> - Best read performance improvement when testing with a single dd transfer.
>  The patched system gives 70% better performance than the unpatched system.
>  (first set of results)
> 
> - The patched system gives 18% better performance than the unpatched system
>  when testing with multiple dd transfers.
>  (second set of results)
> 
> - The write test shows there is no performance hit by the patch.
>  (third set of results)
> 
> - When testing on a different client having less RAM and fewer number of CPU cores,
>  there is no performance degradation for Kerberos in the unpatched system.
>  In this case, the patch does not provide any performance improvement.
>  (fourth set of results)
> 
> ================================================================================
> Test environment:
> 
> NFS client:  CPU: 16 cores, RAM: 32GB (E5620 @ 2.40GHz)
> NFS servers: CPU: 16 cores, RAM: 32GB (E5620 @ 2.40GHz)
> NFS mount:   NFSv3 with sec=(sys or krb5p)
> 
> For tests with a single dd transfer there is of course one NFS server used
> and one file being read -- only one transfer was needed to fill up the
> network connection.
> 
> For tests with multiple dd transfers, three different NFS server were used
> and four different files were used per NFS server for a total of 12 different
> files being read (12 different transfers in parallel).
> 
> The patch was applied on top of 4.14.0-rc3 kernel and the NFS servers were
> running RHEL 7.4.
> 
> The fourth set of results below show an unpatched system with no Kerberos
> degradation (same kernel 4.14.0-rc3) but in contrast with the main client
> used for testing this client has only 4 CPU cores and 8GB of RAM.
> I believe that even though this system has less CPU cores and less RAM,
> the CPU is faster (E31220 @ 3.10GHz vs E5620 @ 2.40GHz) so it is able
> to handle the Kerberos load better and fill up the network connection
> with a single thread than the main client with more CPU cores and more
> memory.

Jorge, thanks for publishing these results.

Can you do a "numactl -H" on your clients and post the output? I suspect
the throughput improvement on the big client is because WQ_UNBOUND
behaves differently on NUMA systems. (Even so, I agree that the proposed
change is valuable).

> ================================================================================
> 
> Kerberos Read Performance: 170.15% (patched system over unpatched system)
> 
> Client CPU:        Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> CPU cores:         16
> RAM:               32 GB
> NFS version:       3
> Mount points:      1
> dd's per mount:    1
> Total dd's:        1
> Data transferred:  7.81 GB (per run)
> Number of runs:    10
> 
> Kerberos Read Performance (unpatched system vs patched system)
> Transfer rate (unpatched system)  avg:  65.88 MB/s,   var:  20.28,   stddev:   4.50
> Transfer rate (patched system)    avg: 112.10 MB/s,   var:   0.00,   stddev:   0.01
> Performance (patched over unpatched):  170.15%
> 
> Unpatched System Read Performance (sys vs krb5p)
> Transfer rate (sec=sys)    avg: 111.96 MB/s,   var:   0.02,   stddev:   0.13
> Transfer rate (sec=krb5p)  avg:  65.88 MB/s,   var:  20.28,   stddev:   4.50
> Performance (krb5p over sys):   58.84%
> 
> Patched System Read Performance (sys vs krb5p)
> Transfer rate (sec=sys)    avg: 111.94 MB/s,   var:   0.02,   stddev:   0.14
> Transfer rate (sec=krb5p)  avg: 112.10 MB/s,   var:   0.00,   stddev:   0.01
> Performance (krb5p over sys):  100.14%
> 
> ================================================================================
> 
> Kerberos Read Performance: 118.02% (patched system over unpatched system)
> 
> Client CPU:        Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> CPU cores:         16
> RAM:               32 GB
> NFS version:       3
> Mount points:      3
> dd's per mount:    4
> Total dd's:        12
> Data transferred:  93.75 GB (per run)
> Number of runs:    10
> 
> Kerberos Read Performance (unpatched system vs patched system)
> Transfer rate (unpatched system)  avg:  94.99 MB/s,   var:  68.96,   stddev:   8.30
> Transfer rate (patched system)    avg: 112.11 MB/s,   var:   0.00,   stddev:   0.03
> Performance (patched over unpatched):  118.02%
> 
> Unpatched System Read Performance (sys vs krb5p)
> Transfer rate (sec=sys)    avg: 112.21 MB/s,   var:   0.00,   stddev:   0.00
> Transfer rate (sec=krb5p)  avg:  94.99 MB/s,   var:  68.96,   stddev:   8.30
> Performance (krb5p over sys):   84.66%
> 
> Patched System Read Performance (sys vs krb5p)
> Transfer rate (sec=sys)    avg: 112.20 MB/s,   var:   0.00,   stddev:   0.00
> Transfer rate (sec=krb5p)  avg: 112.11 MB/s,   var:   0.00,   stddev:   0.03
> Performance (krb5p over sys):   99.92%
> 
> ================================================================================
> 
> Kerberos Write Performance: 101.55% (patched system over unpatched system)
> 
> Client CPU:        Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> CPU cores:         16
> RAM:               32 GB
> NFS version:       3
> Mount points:      3
> dd's per mount:    4
> Total dd's:        12
> Data transferred:  93.75 GB (per run)
> Number of runs:    10
> 
> Kerberos Write Performance (unpatched system vs patched system)
> Transfer rate (unpatched system)  avg: 103.70 MB/s,   var: 110.51,   stddev:  10.51
> Transfer rate (patched system)    avg: 105.31 MB/s,   var:  35.04,   stddev:   5.92
> Performance (patched over unpatched):  101.55%
> 
> Unpatched System Write Performance (sys vs krb5p)
> Transfer rate (sec=sys)    avg: 109.87 MB/s,   var:  10.27,   stddev:   3.20
> Transfer rate (sec=krb5p)  avg: 103.70 MB/s,   var: 110.51,   stddev:  10.51
> Performance (krb5p over sys):   94.39%
> 
> Patched System Write Performance (sys vs krb5p)
> Transfer rate (sec=sys)    avg: 111.03 MB/s,   var:   0.58,   stddev:   0.76
> Transfer rate (sec=krb5p)  avg: 105.31 MB/s,   var:  35.04,   stddev:   5.92
> Performance (krb5p over sys):   94.85%
> 
> ================================================================================
> 
> Kerberos Read Performance: 99.99% (patched system over unpatched system)
> 
> Client CPU:        Intel(R) Xeon(R) CPU E31220 @ 3.10GHz
> CPU cores:         4
> RAM:               8 GB
> NFS version:       3
> Mount points:      1
> dd's per mount:    1
> Total dd's:        1
> Data transferred:  7.81 GB (per run)
> Number of runs:    10
> 
> Kerberos Read Performance (unpatched system vs patched system)
> Transfer rate (unpatched system)  avg: 112.02 MB/s,   var:   0.04,   stddev:   0.21
> Transfer rate (patched system)    avg: 112.01 MB/s,   var:   0.06,   stddev:   0.25
> Performance (patched over unpatched):   99.99%
> 
> Unpatched System Read Performance (sys vs krb5p)
> Transfer rate (sec=sys)    avg: 111.86 MB/s,   var:   0.06,   stddev:   0.24
> Transfer rate (sec=krb5p)  avg: 112.02 MB/s,   var:   0.04,   stddev:   0.21
> Performance (krb5p over sys):  100.14%
> 
> Patched System Read Performance (sys vs krb5p)
> Transfer rate (sec=sys)    avg: 111.76 MB/s,   var:   0.12,   stddev:   0.34
> Transfer rate (sec=krb5p)  avg: 112.01 MB/s,   var:   0.06,   stddev:   0.25
> Performance (krb5p over sys):  100.22%
> 
> 
> --Jorge
> 
> ________________________________________
> From: linux-nfs-owner@xxxxxxxxxxxxxxx <linux-nfs-owner@xxxxxxxxxxxxxxx> on behalf of Olga Kornievskaia <aglo@xxxxxxxxx>
> Sent: Wednesday, July 19, 2017 11:59 AM
> To: Trond Myklebust
> Cc: linux-nfs@xxxxxxxxxxxxxxx; chuck.lever@xxxxxxxxxx
> Subject: Re: [RFC] fix parallelism for rpc tasks
> 
> On Wed, Jul 5, 2017 at 1:33 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>> On Wed, Jul 5, 2017 at 12:14 PM, Trond Myklebust
>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>> On Wed, 2017-07-05 at 12:09 -0400, Olga Kornievskaia wrote:
>>>> On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust
>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>> On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote:
>>>>>>> On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <aglo@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust
>>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote:
>>>>>>>> On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote:
>>>>>>>>> Hi folks,
>>>>>>>>> 
>>>>>>>>> On a multi-core machine, is it expected that we can have
>>>>>>>>> parallel
>>>>>>>>> RPCs
>>>>>>>>> handled by each of the per-core workqueue?
>>>>>>>>> 
>>>>>>>>> In testing a read workload, observing via "top" command
>>>>>>>>> that a
>>>>>>>>> single
>>>>>>>>> "kworker" thread is running servicing the requests (no
>>>>>>>>> parallelism).
>>>>>>>>> It's more prominent while doing these operations over krb5p
>>>>>>>>> mount.
>>>>>>>>> 
>>>>>>>>> What has been suggested by Bruce is to try this and in my
>>>>>>>>> testing I
>>>>>>>>> see then the read workload spread among all the kworker
>>>>>>>>> threads.
>>>>>>>>> 
>>>>>>>>> Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx>
>>>>>>>>> 
>>>>>>>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>>>>>>>> index 0cc8383..f80e688 100644
>>>>>>>>> --- a/net/sunrpc/sched.c
>>>>>>>>> +++ b/net/sunrpc/sched.c
>>>>>>>>> @@ -1095,7 +1095,7 @@ static int rpciod_start(void)
>>>>>>>>> * Create the rpciod thread and wait for it to start.
>>>>>>>>> */
>>>>>>>>> dprintk("RPC:       creating workqueue rpciod\n");
>>>>>>>>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0);
>>>>>>>>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM |
>>>>>>>>> WQ_UNBOUND,
>>>>>>>>> 0);
>>>>>>>>> if (!wq)
>>>>>>>>> goto out_failed;
>>>>>>>>> rpciod_workqueue = wq;
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> WQ_UNBOUND turns off concurrency management on the thread
>>>>>>>> pool
>>>>>>>> (See
>>>>>>>> Documentation/core-api/workqueue.rst. It also means we
>>>>>>>> contend
>>>>>>>> for work
>>>>>>>> item queuing/dequeuing locks, since the threads which run the
>>>>>>>> work
>>>>>>>> items are not bound to a CPU.
>>>>>>>> 
>>>>>>>> IOW: This is not a slam-dunk obvious gain.
>>>>>>> 
>>>>>>> I agree but I think it's worth consideration. I'm waiting to
>>>>>>> get
>>>>>>> (real) performance numbers of improvement (instead of my VM
>>>>>>> setup)
>>>>>>> to
>>>>>>> help my case. However, it was reported 90% degradation for the
>>>>>>> read
>>>>>>> performance over krb5p when 1CPU is executing all ops.
>>>>>>> 
>>>>>>> Is there a different way to make sure that on a multi-processor
>>>>>>> machine we can take advantage of all available CPUs? Simple
>>>>>>> kernel
>>>>>>> threads instead of a work queue?
>>>>>> 
>>>>>> There is a trade-off between spreading the work, and ensuring it
>>>>>> is executed on a CPU close to the I/O and application. IMO
>>>>>> UNBOUND
>>>>>> is a good way to do that. UNBOUND will attempt to schedule the
>>>>>> work on the preferred CPU, but allow it to be migrated if that
>>>>>> CPU is busy.
>>>>>> 
>>>>>> The advantage of this is that when the client workload is CPU
>>>>>> intensive (say, a software build), RPC client work can be
>>>>>> scheduled
>>>>>> and run more quickly, which reduces latency.
>>>>>> 
>>>>> 
>>>>> That should no longer be a huge issue, since queue_work() will now
>>>>> default to the WORK_CPU_UNBOUND flag, which prefers the local CPU,
>>>>> but
>>>>> will schedule elsewhere if the local CPU is congested.
>>>> 
>>>> I don't believe NFS use workqueue_congested() to somehow schedule the
>>>> work elsewhere. Unless the queue is marked UNBOUNDED I don't believe
>>>> there is any intention of balancing the CPU load.
>>>> 
>>> 
>>> I shouldn't have to test the queue when scheduling with
>>> WORK_CPU_UNBOUND.
>>> 
>> 
>> Comments in the code says that "if CPU dies" it'll be re-scheduled on
>> another. I think the code requires to mark the queue UNBOUND to really
>> be scheduled on a different CPU. Just my reading of the code and it
>> matches what is seen with the krb5 workload.
> 
> Trond, what's the path forward here? What about a run-time
> configuration that starts rpciod with the UNBOUND options instead?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> <dd_read_single.html><dd_read_mult.html><dd_write_mult.html><dd_read_single1.html>

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html