Re: [PATCH] NFS: enable nconnect for RDMA

Olga Kornievskaia <aglo@xxxxxxxxx> · Mon, 4 Mar 2024 15:25:05 -0500

On Mon, Mar 4, 2024 at 2:33 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>
>
>
> > On Mar 4, 2024, at 2:01 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> >
> > On Sun, Mar 3, 2024 at 1:35 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
> >>
> >> On Wed, Feb 28, 2024 at 04:35:23PM -0500, trondmy@xxxxxxxxxx wrote:
> >>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> >>>
> >>> It appears that in certain cases, RDMA capable transports can benefit
> >>> from the ability to establish multiple connections to increase their
> >>> throughput. This patch therefore enables the use of the "nconnect" mount
> >>> option for those use cases.
> >>>
> >>> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> >>
> >> No objection to this patch.
> >>
> >> You don't mention here if you have root-caused the throughput issue.
> >> One thing I've noticed is that contention for the transport's
> >> queue_lock is holding back the RPC/RDMA Receive completion handler,
> >> which is single-threaded per transport.
> >
> > Curious: how does a queue_lock per transport is a problem for
> > nconnect? nconnect would create its own transport, would it now and so
> > it would have its own queue_lock (per nconnect).
>
> I did not mean to imply that queue_lock contention is a
> problem for nconnect or would increase when there are
> multiple transports.

Got it. I wanted to make sure I didn't misunderstand something.You are
stating that even for a single transport there could be improvements
made to make sending and receiving more independent of each other.

> But there is definitely lock contention between the send and
> receive code paths, and that could be one source of the relief
> that Trond saw by adding more transports. IMO that contention
> should be addressed at some point.
>
> I'm not asking for a change to the proposed patch. But I am
> suggesting some possible future work.
>
>
> >> A way to mitigate this would be to replace the recv_queue
> >> R-B tree with a data structure that can perform a lookup while
> >> holding only the RCU read lock. I have experimented with using an
> >> xarray for the recv_queue, and saw improvement.
> >>
> >> The workload on that data structure is different for TCP versus
> >> RDMA, though: on RDMA, the number of RPCs in flight is significantly
> >> smaller. For tens of thousands of RPCs in flight, an xarray might
> >> not scale well. The newer Maple tree or rose bush hash, or maybe a
> >> more classic data structure like rhashtable, might handle this
> >> workload better.
> >>
> >> It's also worth considering deleting each RPC from the recv_queue
> >> in a less performance-sensitive context, for example, xprt_release,
> >> rather than in xprt_complete_rqst.
>
>
> --
> Chuck Lever
>
>