RE: [PATCH 0/5] nfs: Add mount option for forcing RPC requests for one file over one connection

Nagendra Tomar <Nagendra.Tomar@xxxxxxxxxxxxx> · Tue, 23 Mar 2021 16:29:45 +0000

> > On Mar 23, 2021, at 11:57 AM, Nagendra Tomar
> <Nagendra.Tomar@xxxxxxxxxxxxx> wrote:
> >
> >>> On Mar 23, 2021, at 1:46 AM, Nagendra Tomar
> >> <Nagendra.Tomar@xxxxxxxxxxxxx> wrote:
> >>>
> >>> From: Nagendra S Tomar <natomar@xxxxxxxxxxxxx>
> >>>
> >>> If a clustered NFS server is behind an L4 loadbalancer the default
> >>> nconnect roundrobin policy may cause RPC requests to a file to be
> >>> sent to different cluster nodes. This is because the source port
> >>> would be different for all the nconnect connections.
> >>> While this should functionally work (since the cluster will usually
> >>> have a consistent view irrespective of which node is serving the
> >>> request), it may not be desirable from performance pov. As an
> >>> example we have an NFSv3 frontend to our Object store, where every
> >>> NFSv3 file is an object. Now if writes to the same file are sent
> >>> roundrobin to different cluster nodes, the writes become very
> >>> inefficient due to the consistency requirement for object update
> >>> being done from different nodes.
> >>> Similarly each node may maintain some kind of cache to serve the file
> >>> data/metadata requests faster and even in that case it helps to have
> >>> a xprt affinity for a file/dir.
> >>> In general we have seen such scheme to scale very well.
> >>>
> >>> This patch introduces a new rpc_xprt_iter_ops for using an additional
> >>> u32 (filehandle hash) to affine RPCs to the same file to one xprt.
> >>> It adds a new mount option "ncpolicy=roundrobin|hash" which can be
> >>> used to select the nconnect multipath policy for a given mount and
> >>> pass the selected policy to the RPC client.
> >>
> >> This sets off my "not another administrative knob that has
> >> to be tested and maintained, and can be abused" allergy.
> >>
> >> Also, my "because connections are shared by mounts of the same
> >> server, all those mounts will all adopt this behavior" rhinitis.
> >
> > Yes, it's fair to call this out, but ncpolicy behaves like the nconnect
> > parameter in this regards.
> >
> >> And my "why add a new feature to a legacy NFS version" hives.
> >>
> >>
> >> I agree that your scenario can and should be addressed somehow.
> >> I'd really rather see this done with pNFS.
> >>
> >> Since you are proposing patches against the upstream NFS client,
> >> I presume all your clients /can/ support NFSv4.1+. It's the NFS
> >> servers that are stuck on NFSv3, correct?
> >
> > Yes.
> >
> >>
> >> The flexfiles layout can handle an NFSv4.1 client and NFSv3 data
> >> servers. In fact it was designed for exactly this kind of mix of
> >> NFS versions.
> >>
> >> No client code change will be necessary -- there are a lot more
> >> clients than servers. The MDS can be made to work smartly in
> >> concert with the load balancer, over time; or it can adopt other
> >> clever strategies.
> >>
> >> IMHO pNFS is the better long-term strategy here.
> >
> > The fundamental difference here is that the clustered NFSv3 server
> > is available over a single virtual IP, so IIUC even if we were to use
> > NFSv41 with flexfiles layout, all it can handover to the client is that single
> > (load-balanced) virtual IP and now when the clients do connect to the
> > NFSv3 DS we still have the same issue. Am I understanding you right?
> > Can you pls elaborate what you mean by "MDS can be made to work
> > smartly in concert with the load balancer"?
> 
> I had thought there were multiple NFSv3 server targets in play.
> 
> If the load balancer is making them look like a single IP address,
> then take it out of the equation: expose all the NFSv3 servers to
> the clients and let the MDS direct operations to each data server.
> 
> AIUI this is the approach (without the use of NFSv3) taken by
> NetApp next generation clusters.

Yeah, if could have clients access all the NFSv3 servers then I agree, pNFS 
would be a viable option. Unfortunately that's not an option in this case. The 
cluster has 100's of nodes and it's not an on-prem server, but a cloud service,
so the simplicity of the single LB VIP is critical.

> 
> >>> It adds a new rpc_procinfo member p_fhhash, which can be supplied
> >>> by the specific RPC programs to return a u32 hash of the file/dir the
> >>> RPC is targetting, and lastly it provides p_fhhash implementation
> >>> for various NFS v3/v4/v41/v42 RPCs to generate the hash correctly.
> >>>
> >>> Thoughts?
> >>>
> >>> Thanks,
> >>> Tomar
> >>>
> >>> Nagendra S Tomar (5):
> >>> SUNRPC: Add a new multipath xprt policy for xprt selection based
> >>>   on target filehandle hash
> >>> SUNRPC/NFSv3/NFSv4: Introduce "enum ncpolicy" to represent the
> >> nconnect
> >>>   policy and pass it down from mount option to rpc layer
> >>> SUNRPC/NFSv4: Rename RPC_TASK_NO_ROUND_ROBIN ->
> >> RPC_TASK_USE_MAIN_XPRT
> >>> NFSv3: Add hash computation methods for NFSv3 RPCs
> >>> NFSv4: Add hash computation methods for NFSv4/NFSv42 RPCs
> >>>
> >>> fs/nfs/client.c                      |   3 +
> >>> fs/nfs/fs_context.c                  |  26 ++
> >>> fs/nfs/internal.h                    |   2 +
> >>> fs/nfs/nfs3client.c                  |   4 +-
> >>> fs/nfs/nfs3xdr.c                     | 154 +++++++++++
> >>> fs/nfs/nfs42xdr.c                    | 112 ++++++++
> >>> fs/nfs/nfs4client.c                  |  14 +-
> >>> fs/nfs/nfs4proc.c                    |  18 +-
> >>> fs/nfs/nfs4xdr.c                     | 516 ++++++++++++++++++++++++++++++-----
> >>> fs/nfs/super.c                       |   7 +-
> >>> include/linux/nfs_fs_sb.h            |   1 +
> >>> include/linux/sunrpc/clnt.h          |  15 +
> >>> include/linux/sunrpc/sched.h         |   2 +-
> >>> include/linux/sunrpc/xprtmultipath.h |   9 +-
> >>> include/trace/events/sunrpc.h        |   4 +-
> >>> net/sunrpc/clnt.c                    |  38 ++-
> >>> net/sunrpc/xprtmultipath.c           |  91 +++++-
> >>> 17 files changed, 913 insertions(+), 103 deletions(-)
> >>
> >> --
> >> Chuck Lever
> 
> --
> Chuck Lever
> 
>