On Fri, 2018-09-07 at 11:44 -0400, Chuck Lever wrote: > > Client: 12-core, two-socket, 56Gb InfiniBand > Server: 4-core, one-socket, 56Gb InfiniBand, tmpfs export > > Test: /usr/bin/fio --size=1G --direct=1 --rw=randrw --refill_buffers > --norandommap --randrepeat=0 --ioengine=libaio --bs=8k --rwmixread=70 > --iodepth=16 --numjobs=16 --runtime=240 --group_reporting > > NFSv3 on RDMA: > Stock v4.19-rc2: > • read: IOPS=109k, BW=849MiB/s (890MB/s)(11.2GiB/13506msec) > • write: IOPS=46.6k, BW=364MiB/s (382MB/s)(4915MiB/13506msec) > Trond's kernel (with fair queuing): > • read: IOPS=83.0k, BW=649MiB/s (680MB/s)(11.2GiB/17676msec) > • write: IOPS=35.6k, BW=278MiB/s (292MB/s)(4921MiB/17676msec) > Trond's kernel (without fair queuing): > • read: IOPS=90.5k, BW=707MiB/s (742MB/s)(11.2GiB/16216msec) > • write: IOPS=38.8k, BW=303MiB/s (318MB/s)(4917MiB/16216msec) > > NFSv3 on TCP (IPoIB): > Stock v4.19-rc2: > • read: IOPS=23.8k, BW=186MiB/s (195MB/s)(11.2GiB/61635msec) > • write: IOPS=10.2k, BW=79.9MiB/s (83.8MB/s)(4923MiB/61635msec) > Trond's kernel (with fair queuing): > • read: IOPS=25.9k, BW=202MiB/s (212MB/s)(11.2GiB/56710msec) > • write: IOPS=11.1k, BW=86.7MiB/s (90.9MB/s)(4916MiB/56710msec) > Trond's kernel (without fair queuing): > • read: IOPS=25.0k, BW=203MiB/s (213MB/s)(11.2GiB/56492msec) > • write: IOPS=11.1k, BW=86.0MiB/s (91.2MB/s)(4915MiB/56492msec) > > > Test: /usr/bin/fio --size=1G --direct=1 --rw=randread -- > refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k > --rwmixread=100 --iodepth=1024 --numjobs=16 --runtime=240 -- > group_reporting > > NFSv3 on RDMA: > Stock v4.19-rc2: > • read: IOPS=149k, BW=580MiB/s (608MB/s)(16.0GiB/28241msec) > Trond's kernel (with fair queuing): > • read: IOPS=81.5k, BW=318MiB/s (334MB/s)(16.0GiB/51450msec) > Trond's kernel (without fair queuing): > • read: IOPS=82.4k, BW=322MiB/s (337MB/s)(16.0GiB/50918msec) > > NFSv3 on TCP (IPoIB): > Stock v4.19-rc2: > • read: IOPS=37.2k, BW=145MiB/s (153MB/s)(16.0GiB/112630msec) > Trond's kernel (with fair queuing): > • read: IOPS=2715, BW=10.6MiB/s (11.1MB/s)(2573MiB/242594msec) > Trond's kernel (without fair queuing): > • read: IOPS=2869, BW=11.2MiB/s (11.8MB/s)(2724MiB/242979msec) > > > Test: /home/cel/bin/iozone -M -i0 -s8g -r512k -az -I -N > > My kernel: 4.19.0-rc2-00026-g50d68a4 > system call latencies in microseconds, N=5: > • write: mean=602, std=13.0 > • rewrite: mean=541, std=17.3 > server round trip latency in microseconds, N=5: > • RTT: mean=354, std=3.0 > > Trond's kernel (with fair queuing): > system call latencies in microseconds, N=5: > • write: mean=572, std=10.6 > • rewrite: mean=533, std=7.9 > server round trip latency in microseconds, N=5: > • RTT: mean=352, std=2.7 Thanks for testing! I've been spending the last 3 days trying to figure out why we're seeing regressions with RDMA. I think I have a few candidates: - The congestion control was failing to wake up the write lock when we queue a request that has already been allocated a congestion control credit. - The livelock avoidance code in xprt_transmit() was causing the queueing to break. - Incorrect return value returned by xprt_transmit() when the queue is empty causes the request to retry waiting for the lock. - A race in xprt_prepare_transmit() could cause the request to wait for the write lock despite having been transmitted by another request. - The change to convert the write lock into a non-priority queue also changed the wake up code, causing the request that is granted the lock to be queued on rpciod, instead of on the low-latency xprtiod workqueue. I've fixed all the above. In addition, I've tightened up a few cases where we were grabbing spinlocks unnecessarily, and I've converted the reply lookup to use an rbtree in order to reduce the amount of time we need to hold the xprt->queue_lock. The new code has been rebased onto 4.19.0-rc3, and is now available on the 'testing' branch. Would you be able to give it another quick spin? Thanks! Trond -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx