On Mon, Mar 04, 2024 at 11:08:00PM +0000, Trond Myklebust wrote: > On Mon, 2024-03-04 at 19:32 +0000, Chuck Lever III wrote: > > > > > > > On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@xxxxxxxxx> > > > wrote: > > > > > > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@xxxxxxxxxx> > > > wrote: > > > > > > > > On Wed, Feb 28, 2024 at 04:35:23PM -0500, > > > > trondmy@xxxxxxxxxx wrote: > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > > > It appears that in certain cases, RDMA capable transports can > > > > > benefit > > > > > from the ability to establish multiple connections to increase > > > > > their > > > > > throughput. This patch therefore enables the use of the > > > > > "nconnect" mount > > > > > option for those use cases. > > > > > > > > > > Signed-off-by: Trond Myklebust > > > > > <trond.myklebust@xxxxxxxxxxxxxxx> > > > > > > > > No objection to this patch. > > > > > > > > You don't mention here if you have root-caused the throughput > > > > issue. > > > > One thing I've noticed is that contention for the transport's > > > > queue_lock is holding back the RPC/RDMA Receive completion > > > > handler, > > > > which is single-threaded per transport. > > > > > > Curious: how does a queue_lock per transport is a problem for > > > nconnect? nconnect would create its own transport, would it now and > > > so > > > it would have its own queue_lock (per nconnect). > > > > I did not mean to imply that queue_lock contention is a > > problem for nconnect or would increase when there are > > multiple transports. > > > > But there is definitely lock contention between the send and > > receive code paths, and that could be one source of the relief > > that Trond saw by adding more transports. IMO that contention > > should be addressed at some point. > > > > I'm not asking for a change to the proposed patch. But I am > > suggesting some possible future work. > > > > We were comparing NFS/RDMA performance to that of NFS/TCP, and it was > clear that the nconnect value was giving the latter a major boost. Once > we enabled nconnect for the RDMA channel, then the values evened out a > lot more. > Once we fixed the nconnect issue, what we were seeing when the RDMA > code maxed out was actually that the CPU got pegged running the IB > completion work queues on writes. > > We can certainly look into improving the performance of > xprt_lookup_rqst() if we have evidence that is slow, but I'm not yet > sure that was what we were seeing. One observation: the Receive completion handler doesn't do anything that is CPU-intensive. If ib_comp_wq is hot, that's an indication of lock contention. I've found there are typically two contended locks when handling RPC/RDMA Receive completions: - The workqueue pool lock. Tejun mitigated that issue in v6.7. - The queue_lock, as described above. A flame graph might narrow the issue. -- Chuck Lever