Re: [PATCH 0/7] Misc NFS + pNFS performance enhancements

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 10 Sep 2018 12:14:29 -0400

> On Sep 9, 2018, at 9:35 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, 2018-09-07 at 11:44 -0400, Chuck Lever wrote:
>> 
>> Client: 12-core, two-socket, 56Gb InfiniBand
>> Server: 4-core, one-socket, 56Gb InfiniBand, tmpfs export
>> 
>> Test: /usr/bin/fio --size=1G --direct=1 --rw=randrw --refill_buffers
>> --norandommap --randrepeat=0 --ioengine=libaio --bs=8k --rwmixread=70 
>> --iodepth=16 --numjobs=16 --runtime=240 --group_reporting
>> 
>> NFSv3 on RDMA:
>> Stock v4.19-rc2:
>> 	• read: IOPS=109k, BW=849MiB/s (890MB/s)(11.2GiB/13506msec)
>> 	• write: IOPS=46.6k, BW=364MiB/s (382MB/s)(4915MiB/13506msec)
>> Trond's kernel (with fair queuing):
>> 	• read: IOPS=83.0k, BW=649MiB/s (680MB/s)(11.2GiB/17676msec)
>> 	• write: IOPS=35.6k, BW=278MiB/s (292MB/s)(4921MiB/17676msec)
>> Trond's kernel (without fair queuing):
>> 	• read: IOPS=90.5k, BW=707MiB/s (742MB/s)(11.2GiB/16216msec)
>> 	• write: IOPS=38.8k, BW=303MiB/s (318MB/s)(4917MiB/16216msec)
>> 
>> NFSv3 on TCP (IPoIB):
>> Stock v4.19-rc2:
>> 	• read: IOPS=23.8k, BW=186MiB/s (195MB/s)(11.2GiB/61635msec)
>> 	• write: IOPS=10.2k, BW=79.9MiB/s (83.8MB/s)(4923MiB/61635msec)
>> Trond's kernel (with fair queuing):
>> 	• read: IOPS=25.9k, BW=202MiB/s (212MB/s)(11.2GiB/56710msec)
>> 	• write: IOPS=11.1k, BW=86.7MiB/s (90.9MB/s)(4916MiB/56710msec)
>> Trond's kernel (without fair queuing):
>> 	• read: IOPS=25.0k, BW=203MiB/s (213MB/s)(11.2GiB/56492msec)
>> 	• write: IOPS=11.1k, BW=86.0MiB/s (91.2MB/s)(4915MiB/56492msec)
>> 
>> 
>> Test: /usr/bin/fio --size=1G --direct=1 --rw=randread --
>> refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k 
>> --rwmixread=100 --iodepth=1024 --numjobs=16 --runtime=240 --
>> group_reporting
>> 
>> NFSv3 on RDMA:
>> Stock v4.19-rc2:
>> 	• read: IOPS=149k, BW=580MiB/s (608MB/s)(16.0GiB/28241msec)
>> Trond's kernel (with fair queuing):
>> 	• read: IOPS=81.5k, BW=318MiB/s (334MB/s)(16.0GiB/51450msec)
>> Trond's kernel (without fair queuing):
>> 	• read: IOPS=82.4k, BW=322MiB/s (337MB/s)(16.0GiB/50918msec)
>> 
>> NFSv3 on TCP (IPoIB):
>> Stock v4.19-rc2:
>> 	• read: IOPS=37.2k, BW=145MiB/s (153MB/s)(16.0GiB/112630msec)
>> Trond's kernel (with fair queuing):
>> 	• read: IOPS=2715, BW=10.6MiB/s (11.1MB/s)(2573MiB/242594msec)
>> Trond's kernel (without fair queuing):
>> 	• read: IOPS=2869, BW=11.2MiB/s (11.8MB/s)(2724MiB/242979msec)
>> 
>> 
>> Test: /home/cel/bin/iozone -M -i0 -s8g -r512k -az -I -N
>> 
>> My kernel: 4.19.0-rc2-00026-g50d68a4
>> system call latencies in microseconds, N=5:
>> 	• write:    mean=602, std=13.0
>> 	• rewrite:  mean=541, std=17.3
>> server round trip latency in microseconds, N=5:
>> 	• RTT:      mean=354, std=3.0
>> 
>> Trond's kernel (with fair queuing):
>> system call latencies in microseconds, N=5:
>> 	• write:    mean=572, std=10.6
>> 	• rewrite:  mean=533, std=7.9
>> server round trip latency in microseconds, N=5:
>> 	• RTT:      mean=352, std=2.7
> 
> Thanks for testing! I've been spending the last 3 days trying to figure
> out why we're seeing regressions with RDMA. I think I have a few
> candidates:
> 
> - The congestion control was failing to wake up the write lock when we
> queue a request that has already been allocated a congestion control
> credit.
> - The livelock avoidance code in xprt_transmit() was causing the
> queueing to break.
> - Incorrect return value returned by xprt_transmit() when the queue is
> empty causes the request to retry waiting for the lock.
> - A race in xprt_prepare_transmit() could cause the request to wait for
> the write lock despite having been transmitted by another request.
> - The change to convert the write lock into a non-priority queue also
> changed the wake up code, causing the request that is granted the lock
> to be queued on rpciod, instead of on the low-latency xprtiod
> workqueue.
> 
> I've fixed all the above. In addition, I've tightened up a few cases
> where we were grabbing spinlocks unnecessarily, and I've converted the
> reply lookup to use an rbtree in order to reduce the amount of time we
> need to hold the xprt->queue_lock.
> 
> The new code has been rebased onto 4.19.0-rc3, and is now available on
> the 'testing' branch. Would you be able to give it another quick spin?

We're in much better shape now. Compare the stock v4.19-rc2
numbers above with these from your latest testing branch.
The new results show a consistent 10% throughput improvement.

test 1 from above:

NFSv3 on RDMA:
4.19.0-rc3-13903-g11dddfd:
	• read: IOPS=118k, BW=921MiB/s (966MB/s)(11.2GiB/12469msec)
	• write: IOPS=50.3k, BW=393MiB/s (412MB/s)(4899MiB/12469msec)

NFSv3 on TCP (IPoIB):
4.19.0-rc3-13903-g11dddfd:
	• read: IOPS=27.4k, BW=214MiB/s (224MB/s)(11.2GiB/53650msec)
	• write: IOPS=11.7k, BW=91.6MiB/s (96.0MB/s)(4913MiB/53650msec)

test 2 from above:

NFSv3 on RDMA:
4.19.0-rc3-13903-g11dddfd:
	• read: IOPS=163k, BW=636MiB/s (667MB/s)(16.0GiB/25743msec)

NFSv3 on TCP (IPoIB):
4.19.0-rc3-13903-g11dddfd:
	• read: IOPS=44.2k, BW=173MiB/s (181MB/s)(16.0GiB/94898msec)

--
Chuck Lever