Re: [PATCH 0/7] Misc NFS + pNFS performance enhancements

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Mon, 10 Sep 2018 01:35:24 +0000

On Fri, 2018-09-07 at 11:44 -0400, Chuck Lever wrote:
> 
> Client: 12-core, two-socket, 56Gb InfiniBand
> Server: 4-core, one-socket, 56Gb InfiniBand, tmpfs export
> 
> Test: /usr/bin/fio --size=1G --direct=1 --rw=randrw --refill_buffers
> --norandommap --randrepeat=0 --ioengine=libaio --bs=8k --rwmixread=70 
> --iodepth=16 --numjobs=16 --runtime=240 --group_reporting
> 
> NFSv3 on RDMA:
> Stock v4.19-rc2:
> 	• read: IOPS=109k, BW=849MiB/s (890MB/s)(11.2GiB/13506msec)
> 	• write: IOPS=46.6k, BW=364MiB/s (382MB/s)(4915MiB/13506msec)
> Trond's kernel (with fair queuing):
> 	• read: IOPS=83.0k, BW=649MiB/s (680MB/s)(11.2GiB/17676msec)
> 	• write: IOPS=35.6k, BW=278MiB/s (292MB/s)(4921MiB/17676msec)
> Trond's kernel (without fair queuing):
> 	• read: IOPS=90.5k, BW=707MiB/s (742MB/s)(11.2GiB/16216msec)
> 	• write: IOPS=38.8k, BW=303MiB/s (318MB/s)(4917MiB/16216msec)
> 
> NFSv3 on TCP (IPoIB):
> Stock v4.19-rc2:
> 	• read: IOPS=23.8k, BW=186MiB/s (195MB/s)(11.2GiB/61635msec)
> 	• write: IOPS=10.2k, BW=79.9MiB/s (83.8MB/s)(4923MiB/61635msec)
> Trond's kernel (with fair queuing):
> 	• read: IOPS=25.9k, BW=202MiB/s (212MB/s)(11.2GiB/56710msec)
> 	• write: IOPS=11.1k, BW=86.7MiB/s (90.9MB/s)(4916MiB/56710msec)
> Trond's kernel (without fair queuing):
> 	• read: IOPS=25.0k, BW=203MiB/s (213MB/s)(11.2GiB/56492msec)
> 	• write: IOPS=11.1k, BW=86.0MiB/s (91.2MB/s)(4915MiB/56492msec)
> 
> 
> Test: /usr/bin/fio --size=1G --direct=1 --rw=randread --
> refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k 
> --rwmixread=100 --iodepth=1024 --numjobs=16 --runtime=240 --
> group_reporting
> 
> NFSv3 on RDMA:
> Stock v4.19-rc2:
> 	• read: IOPS=149k, BW=580MiB/s (608MB/s)(16.0GiB/28241msec)
> Trond's kernel (with fair queuing):
> 	• read: IOPS=81.5k, BW=318MiB/s (334MB/s)(16.0GiB/51450msec)
> Trond's kernel (without fair queuing):
> 	• read: IOPS=82.4k, BW=322MiB/s (337MB/s)(16.0GiB/50918msec)
> 
> NFSv3 on TCP (IPoIB):
> Stock v4.19-rc2:
> 	• read: IOPS=37.2k, BW=145MiB/s (153MB/s)(16.0GiB/112630msec)
> Trond's kernel (with fair queuing):
> 	• read: IOPS=2715, BW=10.6MiB/s (11.1MB/s)(2573MiB/242594msec)
> Trond's kernel (without fair queuing):
> 	• read: IOPS=2869, BW=11.2MiB/s (11.8MB/s)(2724MiB/242979msec)
> 
> 
> Test: /home/cel/bin/iozone -M -i0 -s8g -r512k -az -I -N
> 
> My kernel: 4.19.0-rc2-00026-g50d68a4
> system call latencies in microseconds, N=5:
> 	• write:    mean=602, std=13.0
> 	• rewrite:  mean=541, std=17.3
> server round trip latency in microseconds, N=5:
> 	• RTT:      mean=354, std=3.0
> 
> Trond's kernel (with fair queuing):
> system call latencies in microseconds, N=5:
> 	• write:    mean=572, std=10.6
> 	• rewrite:  mean=533, std=7.9
> server round trip latency in microseconds, N=5:
> 	• RTT:      mean=352, std=2.7

Thanks for testing! I've been spending the last 3 days trying to figure
out why we're seeing regressions with RDMA. I think I have a few
candidates:

- The congestion control was failing to wake up the write lock when we
queue a request that has already been allocated a congestion control
credit.
- The livelock avoidance code in xprt_transmit() was causing the
queueing to break.
- Incorrect return value returned by xprt_transmit() when the queue is
empty causes the request to retry waiting for the lock.
- A race in xprt_prepare_transmit() could cause the request to wait for
the write lock despite having been transmitted by another request.
- The change to convert the write lock into a non-priority queue also
changed the wake up code, causing the request that is granted the lock
to be queued on rpciod, instead of on the low-latency xprtiod
workqueue.

I've fixed all the above. In addition, I've tightened up a few cases
where we were grabbing spinlocks unnecessarily, and I've converted the
reply lookup to use an rbtree in order to reduce the amount of time we
need to hold the xprt->queue_lock.

The new code has been rebased onto 4.19.0-rc3, and is now available on
the 'testing' branch. Would you be able to give it another quick spin?

Thanks!
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx