> On Sep 9, 2018, at 9:35 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Fri, 2018-09-07 at 11:44 -0400, Chuck Lever wrote: >> >> Client: 12-core, two-socket, 56Gb InfiniBand >> Server: 4-core, one-socket, 56Gb InfiniBand, tmpfs export >> >> Test: /usr/bin/fio --size=1G --direct=1 --rw=randrw --refill_buffers >> --norandommap --randrepeat=0 --ioengine=libaio --bs=8k --rwmixread=70 >> --iodepth=16 --numjobs=16 --runtime=240 --group_reporting >> >> NFSv3 on RDMA: >> Stock v4.19-rc2: >> • read: IOPS=109k, BW=849MiB/s (890MB/s)(11.2GiB/13506msec) >> • write: IOPS=46.6k, BW=364MiB/s (382MB/s)(4915MiB/13506msec) >> Trond's kernel (with fair queuing): >> • read: IOPS=83.0k, BW=649MiB/s (680MB/s)(11.2GiB/17676msec) >> • write: IOPS=35.6k, BW=278MiB/s (292MB/s)(4921MiB/17676msec) >> Trond's kernel (without fair queuing): >> • read: IOPS=90.5k, BW=707MiB/s (742MB/s)(11.2GiB/16216msec) >> • write: IOPS=38.8k, BW=303MiB/s (318MB/s)(4917MiB/16216msec) >> >> NFSv3 on TCP (IPoIB): >> Stock v4.19-rc2: >> • read: IOPS=23.8k, BW=186MiB/s (195MB/s)(11.2GiB/61635msec) >> • write: IOPS=10.2k, BW=79.9MiB/s (83.8MB/s)(4923MiB/61635msec) >> Trond's kernel (with fair queuing): >> • read: IOPS=25.9k, BW=202MiB/s (212MB/s)(11.2GiB/56710msec) >> • write: IOPS=11.1k, BW=86.7MiB/s (90.9MB/s)(4916MiB/56710msec) >> Trond's kernel (without fair queuing): >> • read: IOPS=25.0k, BW=203MiB/s (213MB/s)(11.2GiB/56492msec) >> • write: IOPS=11.1k, BW=86.0MiB/s (91.2MB/s)(4915MiB/56492msec) >> >> >> Test: /usr/bin/fio --size=1G --direct=1 --rw=randread -- >> refill_buffers --norandommap --randrepeat=0 --ioengine=libaio --bs=4k >> --rwmixread=100 --iodepth=1024 --numjobs=16 --runtime=240 -- >> group_reporting >> >> NFSv3 on RDMA: >> Stock v4.19-rc2: >> • read: IOPS=149k, BW=580MiB/s (608MB/s)(16.0GiB/28241msec) >> Trond's kernel (with fair queuing): >> • read: IOPS=81.5k, BW=318MiB/s (334MB/s)(16.0GiB/51450msec) >> Trond's kernel (without fair queuing): >> • read: IOPS=82.4k, BW=322MiB/s (337MB/s)(16.0GiB/50918msec) >> >> NFSv3 on TCP (IPoIB): >> Stock v4.19-rc2: >> • read: IOPS=37.2k, BW=145MiB/s (153MB/s)(16.0GiB/112630msec) >> Trond's kernel (with fair queuing): >> • read: IOPS=2715, BW=10.6MiB/s (11.1MB/s)(2573MiB/242594msec) >> Trond's kernel (without fair queuing): >> • read: IOPS=2869, BW=11.2MiB/s (11.8MB/s)(2724MiB/242979msec) >> >> >> Test: /home/cel/bin/iozone -M -i0 -s8g -r512k -az -I -N >> >> My kernel: 4.19.0-rc2-00026-g50d68a4 >> system call latencies in microseconds, N=5: >> • write: mean=602, std=13.0 >> • rewrite: mean=541, std=17.3 >> server round trip latency in microseconds, N=5: >> • RTT: mean=354, std=3.0 >> >> Trond's kernel (with fair queuing): >> system call latencies in microseconds, N=5: >> • write: mean=572, std=10.6 >> • rewrite: mean=533, std=7.9 >> server round trip latency in microseconds, N=5: >> • RTT: mean=352, std=2.7 > > Thanks for testing! I've been spending the last 3 days trying to figure > out why we're seeing regressions with RDMA. I think I have a few > candidates: > > - The congestion control was failing to wake up the write lock when we > queue a request that has already been allocated a congestion control > credit. > - The livelock avoidance code in xprt_transmit() was causing the > queueing to break. > - Incorrect return value returned by xprt_transmit() when the queue is > empty causes the request to retry waiting for the lock. > - A race in xprt_prepare_transmit() could cause the request to wait for > the write lock despite having been transmitted by another request. > - The change to convert the write lock into a non-priority queue also > changed the wake up code, causing the request that is granted the lock > to be queued on rpciod, instead of on the low-latency xprtiod > workqueue. > > I've fixed all the above. In addition, I've tightened up a few cases > where we were grabbing spinlocks unnecessarily, and I've converted the > reply lookup to use an rbtree in order to reduce the amount of time we > need to hold the xprt->queue_lock. > > The new code has been rebased onto 4.19.0-rc3, and is now available on > the 'testing' branch. Would you be able to give it another quick spin? We're in much better shape now. Compare the stock v4.19-rc2 numbers above with these from your latest testing branch. The new results show a consistent 10% throughput improvement. test 1 from above: NFSv3 on RDMA: 4.19.0-rc3-13903-g11dddfd: • read: IOPS=118k, BW=921MiB/s (966MB/s)(11.2GiB/12469msec) • write: IOPS=50.3k, BW=393MiB/s (412MB/s)(4899MiB/12469msec) NFSv3 on TCP (IPoIB): 4.19.0-rc3-13903-g11dddfd: • read: IOPS=27.4k, BW=214MiB/s (224MB/s)(11.2GiB/53650msec) • write: IOPS=11.7k, BW=91.6MiB/s (96.0MB/s)(4913MiB/53650msec) test 2 from above: NFSv3 on RDMA: 4.19.0-rc3-13903-g11dddfd: • read: IOPS=163k, BW=636MiB/s (667MB/s)(16.0GiB/25743msec) NFSv3 on TCP (IPoIB): 4.19.0-rc3-13903-g11dddfd: • read: IOPS=44.2k, BW=173MiB/s (181MB/s)(16.0GiB/94898msec) -- Chuck Lever