> On May 28, 2019, at 3:52 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > > > >> On May 28, 2019, at 3:33 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: >> >> On Tue, 2019-05-28 at 15:03 -0400, Chuck Lever wrote: >>> Following up on this. Now with even more data! >>> >>>> On May 6, 2019, at 4:41 PM, Chuck Lever <chuck.lever@xxxxxxxxxx> >>>> wrote: >>>> >>>> >>>>> On May 3, 2019, at 7:18 AM, Trond Myklebust <trondmy@xxxxxxxxx> >>>>> wrote: >>>>> >>>>> Allow more time for softirqd >>>> >>>> Have you thought about performance tests for this one? >>> >>> I tested this series on my 12-core two-socket client using a variety >>> of tests including iozone, fio, and fstests. The network under test >>> is 56Gb InfiniBand (TCP uses IPoIB). I tested both TCP and RDMA. >>> >>> With lock debugging and memory leak testing enabled, I did not see >>> any functional regressions or new leaks or crashes. Thus IMO this >>> series is "safe to apply." >>> >>> With TCP, I saw no change in performance between a "stock" kernel >>> and one with all five patches in this series applied, as, IIRC, >>> you predicted. >>> >>> The following discussion is based on testing with NFS/RDMA. >>> >>> With RDMA, I saw an improvement of 5-10% in IOPS rate between the >>> "stock" kernel and a kernel with the first four patches applied. When >>> the fifth patch is applied, I saw IOPS throughput significantly worse >>> than "stock" -- like 20% worse. >>> >>> I also studied average RPC execution time (the "execute" metric) with >>> the "stock" kernel, the one with four patches applied, and with the >>> one where all five are applied. The workload is 100% 4KB READs with >>> an iodepth of 1024 in order to saturate the transmit queue. >>> >>> With four patches, the execute time is about 2.5 msec faster (average >>> execution time is around 75 msec due to the large backlog this test >>> generates). With five patches, it's slower than "stock" by 12 msec. >>> >>> I also saw a 30 usec improvement in the average latency of >>> xprt_complete_rqst with the four patch series. >>> >>> As far as I can tell, the benefit of this series comes mostly from >>> the third patch, which changes spin_lock_bh(&xprt->transport_lock) to >>> spin_lock(&xprt->transport_lock). When the xprtiod work queue is >>> lowered in priority in 5/5, that benefit vanishes. >>> >>> I am still confused about why 5/5 is needed. I did not see any soft >>> lockups without this patch applied when using RDMA. Is the issue >>> with xprtsock's use of xprtiod for handling incoming TCP receives? >>> >>> I still have some things I'd like to look at. One thing I haven't >>> yet tried is looking at lock_stat, which would confirm or refute >>> my theory that this is all about the transport_lock, for instance. >>> >> >> OK. I can drop 5/5. >> >> The issue there was not about soft lockups. However since we were >> previously running most soft irqs as part of spin_unlock_bh(), the >> question was whether or not we would see more of them needing to move >> to softirqd. As far as I can see, your answer to that question is 'no' >> (at least for your system). > > The top contended lock now is the work queue lock. I believe that's a > full irqsave lock. Someone should try testing on a single core system. > > I also plan to try this series on my mlx5_en system. The mlx5 Ethernet > driver does a lot more work in soft IRQ than mlx4/IB does. I tested with CX-5 RoCE on 100GbE. I don't see any obvious signs of soft IRQ starvation. With 8 threads on a 4-core client, I was able to push the 4KB random read fio workload past 300KIOPS. -- Chuck Lever