On Mon, Mar 4, 2024 at 2:33 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > > > On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote: > > > > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote: > >> > >> On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@xxxxxxxxxx wrote: > >>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > >>> > >>> It appears that in certain cases, RDMA capable transports can benefit > >>> from the ability to establish multiple connections to increase their > >>> throughput. This patch therefore enables the use of the "nconnect" mount > >>> option for those use cases. > >>> > >>> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > >> > >> No objection to this patch. > >> > >> You don't mention here if you have root-caused the throughput issue. > >> One thing I've noticed is that contention for the transport's > >> queue_lock is holding back the RPC/RDMA Receive completion handler, > >> which is single-threaded per transport. > > > > Curious: how does a queue_lock per transport is a problem for > > nconnect? nconnect would create its own transport, would it now and so > > it would have its own queue_lock (per nconnect). > > I did not mean to imply that queue_lock contention is a > problem for nconnect or would increase when there are > multiple transports. Got it. I wanted to make sure I didn't misunderstand something.You are stating that even for a single transport there could be improvements made to make sending and receiving more independent of each other. > But there is definitely lock contention between the send and > receive code paths, and that could be one source of the relief > that Trond saw by adding more transports. IMO that contention > should be addressed at some point. > > I'm not asking for a change to the proposed patch. But I am > suggesting some possible future work. > > > >> A way to mitigate this would be to replace the recv_queue > >> R-B tree with a data structure that can perform a lookup while > >> holding only the RCU read lock. I have experimented with using an > >> xarray for the recv_queue, and saw improvement. > >> > >> The workload on that data structure is different for TCP versus > >> RDMA, though: on RDMA, the number of RPCs in flight is significantly > >> smaller. For tens of thousands of RPCs in flight, an xarray might > >> not scale well. The newer Maple tree or rose bush hash, or maybe a > >> more classic data structure like rhashtable, might handle this > >> workload better. > >> > >> It's also worth considering deleting each RPC from the recv_queue > >> in a less performance-sensitive context, for example, xprt_release, > >> rather than in xprt_complete_rqst. > > > -- > Chuck Lever > >