> On Feb 14, 2018, at 6:13 PM, Mora, Jorge <Jorge.Mora@xxxxxxxxxx> wrote: > > Hello, > > The patch gives some performance improvement on Kerberos read. > The following results show performance comparisons between unpatched > and patched systems. The html files included as attachments show the > results as line charts. > > - Best read performance improvement when testing with a single dd transfer. > The patched system gives 70% better performance than the unpatched system. > (first set of results) > > - The patched system gives 18% better performance than the unpatched system > when testing with multiple dd transfers. > (second set of results) > > - The write test shows there is no performance hit by the patch. > (third set of results) > > - When testing on a different client having less RAM and fewer number of CPU cores, > there is no performance degradation for Kerberos in the unpatched system. > In this case, the patch does not provide any performance improvement. > (fourth set of results) > > ================================================================================ > Test environment: > > NFS client: CPU: 16 cores, RAM: 32GB (E5620 @ 2.40GHz) > NFS servers: CPU: 16 cores, RAM: 32GB (E5620 @ 2.40GHz) > NFS mount: NFSv3 with sec=(sys or krb5p) > > For tests with a single dd transfer there is of course one NFS server used > and one file being read -- only one transfer was needed to fill up the > network connection. > > For tests with multiple dd transfers, three different NFS server were used > and four different files were used per NFS server for a total of 12 different > files being read (12 different transfers in parallel). > > The patch was applied on top of 4.14.0-rc3 kernel and the NFS servers were > running RHEL 7.4. > > The fourth set of results below show an unpatched system with no Kerberos > degradation (same kernel 4.14.0-rc3) but in contrast with the main client > used for testing this client has only 4 CPU cores and 8GB of RAM. > I believe that even though this system has less CPU cores and less RAM, > the CPU is faster (E31220 @ 3.10GHz vs E5620 @ 2.40GHz) so it is able > to handle the Kerberos load better and fill up the network connection > with a single thread than the main client with more CPU cores and more > memory. Jorge, thanks for publishing these results. Can you do a "numactl -H" on your clients and post the output? I suspect the throughput improvement on the big client is because WQ_UNBOUND behaves differently on NUMA systems. (Even so, I agree that the proposed change is valuable). > ================================================================================ > > Kerberos Read Performance: 170.15% (patched system over unpatched system) > > Client CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz > CPU cores: 16 > RAM: 32 GB > NFS version: 3 > Mount points: 1 > dd's per mount: 1 > Total dd's: 1 > Data transferred: 7.81 GB (per run) > Number of runs: 10 > > Kerberos Read Performance (unpatched system vs patched system) > Transfer rate (unpatched system) avg: 65.88 MB/s, var: 20.28, stddev: 4.50 > Transfer rate (patched system) avg: 112.10 MB/s, var: 0.00, stddev: 0.01 > Performance (patched over unpatched): 170.15% > > Unpatched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.96 MB/s, var: 0.02, stddev: 0.13 > Transfer rate (sec=krb5p) avg: 65.88 MB/s, var: 20.28, stddev: 4.50 > Performance (krb5p over sys): 58.84% > > Patched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.94 MB/s, var: 0.02, stddev: 0.14 > Transfer rate (sec=krb5p) avg: 112.10 MB/s, var: 0.00, stddev: 0.01 > Performance (krb5p over sys): 100.14% > > ================================================================================ > > Kerberos Read Performance: 118.02% (patched system over unpatched system) > > Client CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz > CPU cores: 16 > RAM: 32 GB > NFS version: 3 > Mount points: 3 > dd's per mount: 4 > Total dd's: 12 > Data transferred: 93.75 GB (per run) > Number of runs: 10 > > Kerberos Read Performance (unpatched system vs patched system) > Transfer rate (unpatched system) avg: 94.99 MB/s, var: 68.96, stddev: 8.30 > Transfer rate (patched system) avg: 112.11 MB/s, var: 0.00, stddev: 0.03 > Performance (patched over unpatched): 118.02% > > Unpatched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 112.21 MB/s, var: 0.00, stddev: 0.00 > Transfer rate (sec=krb5p) avg: 94.99 MB/s, var: 68.96, stddev: 8.30 > Performance (krb5p over sys): 84.66% > > Patched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 112.20 MB/s, var: 0.00, stddev: 0.00 > Transfer rate (sec=krb5p) avg: 112.11 MB/s, var: 0.00, stddev: 0.03 > Performance (krb5p over sys): 99.92% > > ================================================================================ > > Kerberos Write Performance: 101.55% (patched system over unpatched system) > > Client CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz > CPU cores: 16 > RAM: 32 GB > NFS version: 3 > Mount points: 3 > dd's per mount: 4 > Total dd's: 12 > Data transferred: 93.75 GB (per run) > Number of runs: 10 > > Kerberos Write Performance (unpatched system vs patched system) > Transfer rate (unpatched system) avg: 103.70 MB/s, var: 110.51, stddev: 10.51 > Transfer rate (patched system) avg: 105.31 MB/s, var: 35.04, stddev: 5.92 > Performance (patched over unpatched): 101.55% > > Unpatched System Write Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 109.87 MB/s, var: 10.27, stddev: 3.20 > Transfer rate (sec=krb5p) avg: 103.70 MB/s, var: 110.51, stddev: 10.51 > Performance (krb5p over sys): 94.39% > > Patched System Write Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.03 MB/s, var: 0.58, stddev: 0.76 > Transfer rate (sec=krb5p) avg: 105.31 MB/s, var: 35.04, stddev: 5.92 > Performance (krb5p over sys): 94.85% > > ================================================================================ > > Kerberos Read Performance: 99.99% (patched system over unpatched system) > > Client CPU: Intel(R) Xeon(R) CPU E31220 @ 3.10GHz > CPU cores: 4 > RAM: 8 GB > NFS version: 3 > Mount points: 1 > dd's per mount: 1 > Total dd's: 1 > Data transferred: 7.81 GB (per run) > Number of runs: 10 > > Kerberos Read Performance (unpatched system vs patched system) > Transfer rate (unpatched system) avg: 112.02 MB/s, var: 0.04, stddev: 0.21 > Transfer rate (patched system) avg: 112.01 MB/s, var: 0.06, stddev: 0.25 > Performance (patched over unpatched): 99.99% > > Unpatched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.86 MB/s, var: 0.06, stddev: 0.24 > Transfer rate (sec=krb5p) avg: 112.02 MB/s, var: 0.04, stddev: 0.21 > Performance (krb5p over sys): 100.14% > > Patched System Read Performance (sys vs krb5p) > Transfer rate (sec=sys) avg: 111.76 MB/s, var: 0.12, stddev: 0.34 > Transfer rate (sec=krb5p) avg: 112.01 MB/s, var: 0.06, stddev: 0.25 > Performance (krb5p over sys): 100.22% > > > --Jorge > > ________________________________________ > From: linux-nfs-owner@xxxxxxxxxxxxxxx <linux-nfs-owner@xxxxxxxxxxxxxxx> on behalf of Olga Kornievskaia <aglo@xxxxxxxxx> > Sent: Wednesday, July 19, 2017 11:59 AM > To: Trond Myklebust > Cc: linux-nfs@xxxxxxxxxxxxxxx; chuck.lever@xxxxxxxxxx > Subject: Re: [RFC] fix parallelism for rpc tasks > > On Wed, Jul 5, 2017 at 1:33 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: >> On Wed, Jul 5, 2017 at 12:14 PM, Trond Myklebust >> <trondmy@xxxxxxxxxxxxxxx> wrote: >>> On Wed, 2017-07-05 at 12:09 -0400, Olga Kornievskaia wrote: >>>> On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust >>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>> On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote: >>>>>>> On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <aglo@xxxxxxxxx> >>>>>>> wrote: >>>>>>> >>>>>>> On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust >>>>>>> <trondmy@xxxxxxxxxxxxxxx> wrote: >>>>>>>> On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote: >>>>>>>>> Hi folks, >>>>>>>>> >>>>>>>>> On a multi-core machine, is it expected that we can have >>>>>>>>> parallel >>>>>>>>> RPCs >>>>>>>>> handled by each of the per-core workqueue? >>>>>>>>> >>>>>>>>> In testing a read workload, observing via "top" command >>>>>>>>> that a >>>>>>>>> single >>>>>>>>> "kworker" thread is running servicing the requests (no >>>>>>>>> parallelism). >>>>>>>>> It's more prominent while doing these operations over krb5p >>>>>>>>> mount. >>>>>>>>> >>>>>>>>> What has been suggested by Bruce is to try this and in my >>>>>>>>> testing I >>>>>>>>> see then the read workload spread among all the kworker >>>>>>>>> threads. >>>>>>>>> >>>>>>>>> Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx> >>>>>>>>> >>>>>>>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >>>>>>>>> index 0cc8383..f80e688 100644 >>>>>>>>> --- a/net/sunrpc/sched.c >>>>>>>>> +++ b/net/sunrpc/sched.c >>>>>>>>> @@ -1095,7 +1095,7 @@ static int rpciod_start(void) >>>>>>>>> * Create the rpciod thread and wait for it to start. >>>>>>>>> */ >>>>>>>>> dprintk("RPC: creating workqueue rpciod\n"); >>>>>>>>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0); >>>>>>>>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | >>>>>>>>> WQ_UNBOUND, >>>>>>>>> 0); >>>>>>>>> if (!wq) >>>>>>>>> goto out_failed; >>>>>>>>> rpciod_workqueue = wq; >>>>>>>>> >>>>>>>> >>>>>>>> WQ_UNBOUND turns off concurrency management on the thread >>>>>>>> pool >>>>>>>> (See >>>>>>>> Documentation/core-api/workqueue.rst. It also means we >>>>>>>> contend >>>>>>>> for work >>>>>>>> item queuing/dequeuing locks, since the threads which run the >>>>>>>> work >>>>>>>> items are not bound to a CPU. >>>>>>>> >>>>>>>> IOW: This is not a slam-dunk obvious gain. >>>>>>> >>>>>>> I agree but I think it's worth consideration. I'm waiting to >>>>>>> get >>>>>>> (real) performance numbers of improvement (instead of my VM >>>>>>> setup) >>>>>>> to >>>>>>> help my case. However, it was reported 90% degradation for the >>>>>>> read >>>>>>> performance over krb5p when 1CPU is executing all ops. >>>>>>> >>>>>>> Is there a different way to make sure that on a multi-processor >>>>>>> machine we can take advantage of all available CPUs? Simple >>>>>>> kernel >>>>>>> threads instead of a work queue? >>>>>> >>>>>> There is a trade-off between spreading the work, and ensuring it >>>>>> is executed on a CPU close to the I/O and application. IMO >>>>>> UNBOUND >>>>>> is a good way to do that. UNBOUND will attempt to schedule the >>>>>> work on the preferred CPU, but allow it to be migrated if that >>>>>> CPU is busy. >>>>>> >>>>>> The advantage of this is that when the client workload is CPU >>>>>> intensive (say, a software build), RPC client work can be >>>>>> scheduled >>>>>> and run more quickly, which reduces latency. >>>>>> >>>>> >>>>> That should no longer be a huge issue, since queue_work() will now >>>>> default to the WORK_CPU_UNBOUND flag, which prefers the local CPU, >>>>> but >>>>> will schedule elsewhere if the local CPU is congested. >>>> >>>> I don't believe NFS use workqueue_congested() to somehow schedule the >>>> work elsewhere. Unless the queue is marked UNBOUNDED I don't believe >>>> there is any intention of balancing the CPU load. >>>> >>> >>> I shouldn't have to test the queue when scheduling with >>> WORK_CPU_UNBOUND. >>> >> >> Comments in the code says that "if CPU dies" it'll be re-scheduled on >> another. I think the code requires to mark the queue UNBOUND to really >> be scheduled on a different CPU. Just my reading of the code and it >> matches what is seen with the krb5 workload. > > Trond, what's the path forward here? What about a run-time > configuration that starts rpciod with the UNBOUND options instead? > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > <dd_read_single.html><dd_read_mult.html><dd_write_mult.html><dd_read_single1.html> -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html