> > On Mar 23, 2021, at 11:57 AM, Nagendra Tomar > <Nagendra.Tomar@xxxxxxxxxxxxx> wrote: > > > >>> On Mar 23, 2021, at 1:46 AM, Nagendra Tomar > >> <Nagendra.Tomar@xxxxxxxxxxxxx> wrote: > >>> > >>> From: Nagendra S Tomar <natomar@xxxxxxxxxxxxx> > >>> > >>> If a clustered NFS server is behind an L4 loadbalancer the default > >>> nconnect roundrobin policy may cause RPC requests to a file to be > >>> sent to different cluster nodes. This is because the source port > >>> would be different for all the nconnect connections. > >>> While this should functionally work (since the cluster will usually > >>> have a consistent view irrespective of which node is serving the > >>> request), it may not be desirable from performance pov. As an > >>> example we have an NFSv3 frontend to our Object store, where every > >>> NFSv3 file is an object. Now if writes to the same file are sent > >>> roundrobin to different cluster nodes, the writes become very > >>> inefficient due to the consistency requirement for object update > >>> being done from different nodes. > >>> Similarly each node may maintain some kind of cache to serve the file > >>> data/metadata requests faster and even in that case it helps to have > >>> a xprt affinity for a file/dir. > >>> In general we have seen such scheme to scale very well. > >>> > >>> This patch introduces a new rpc_xprt_iter_ops for using an additional > >>> u32 (filehandle hash) to affine RPCs to the same file to one xprt. > >>> It adds a new mount option "ncpolicy=roundrobin|hash" which can be > >>> used to select the nconnect multipath policy for a given mount and > >>> pass the selected policy to the RPC client. > >> > >> This sets off my "not another administrative knob that has > >> to be tested and maintained, and can be abused" allergy. > >> > >> Also, my "because connections are shared by mounts of the same > >> server, all those mounts will all adopt this behavior" rhinitis. > > > > Yes, it's fair to call this out, but ncpolicy behaves like the nconnect > > parameter in this regards. > > > >> And my "why add a new feature to a legacy NFS version" hives. > >> > >> > >> I agree that your scenario can and should be addressed somehow. > >> I'd really rather see this done with pNFS. > >> > >> Since you are proposing patches against the upstream NFS client, > >> I presume all your clients /can/ support NFSv4.1+. It's the NFS > >> servers that are stuck on NFSv3, correct? > > > > Yes. > > > >> > >> The flexfiles layout can handle an NFSv4.1 client and NFSv3 data > >> servers. In fact it was designed for exactly this kind of mix of > >> NFS versions. > >> > >> No client code change will be necessary -- there are a lot more > >> clients than servers. The MDS can be made to work smartly in > >> concert with the load balancer, over time; or it can adopt other > >> clever strategies. > >> > >> IMHO pNFS is the better long-term strategy here. > > > > The fundamental difference here is that the clustered NFSv3 server > > is available over a single virtual IP, so IIUC even if we were to use > > NFSv41 with flexfiles layout, all it can handover to the client is that single > > (load-balanced) virtual IP and now when the clients do connect to the > > NFSv3 DS we still have the same issue. Am I understanding you right? > > Can you pls elaborate what you mean by "MDS can be made to work > > smartly in concert with the load balancer"? > > I had thought there were multiple NFSv3 server targets in play. > > If the load balancer is making them look like a single IP address, > then take it out of the equation: expose all the NFSv3 servers to > the clients and let the MDS direct operations to each data server. > > AIUI this is the approach (without the use of NFSv3) taken by > NetApp next generation clusters. Yeah, if could have clients access all the NFSv3 servers then I agree, pNFS would be a viable option. Unfortunately that's not an option in this case. The cluster has 100's of nodes and it's not an on-prem server, but a cloud service, so the simplicity of the single LB VIP is critical. > > >>> It adds a new rpc_procinfo member p_fhhash, which can be supplied > >>> by the specific RPC programs to return a u32 hash of the file/dir the > >>> RPC is targetting, and lastly it provides p_fhhash implementation > >>> for various NFS v3/v4/v41/v42 RPCs to generate the hash correctly. > >>> > >>> Thoughts? > >>> > >>> Thanks, > >>> Tomar > >>> > >>> Nagendra S Tomar (5): > >>> SUNRPC: Add a new multipath xprt policy for xprt selection based > >>> on target filehandle hash > >>> SUNRPC/NFSv3/NFSv4: Introduce "enum ncpolicy" to represent the > >> nconnect > >>> policy and pass it down from mount option to rpc layer > >>> SUNRPC/NFSv4: Rename RPC_TASK_NO_ROUND_ROBIN -> > >> RPC_TASK_USE_MAIN_XPRT > >>> NFSv3: Add hash computation methods for NFSv3 RPCs > >>> NFSv4: Add hash computation methods for NFSv4/NFSv42 RPCs > >>> > >>> fs/nfs/client.c | 3 + > >>> fs/nfs/fs_context.c | 26 ++ > >>> fs/nfs/internal.h | 2 + > >>> fs/nfs/nfs3client.c | 4 +- > >>> fs/nfs/nfs3xdr.c | 154 +++++++++++ > >>> fs/nfs/nfs42xdr.c | 112 ++++++++ > >>> fs/nfs/nfs4client.c | 14 +- > >>> fs/nfs/nfs4proc.c | 18 +- > >>> fs/nfs/nfs4xdr.c | 516 ++++++++++++++++++++++++++++++----- > >>> fs/nfs/super.c | 7 +- > >>> include/linux/nfs_fs_sb.h | 1 + > >>> include/linux/sunrpc/clnt.h | 15 + > >>> include/linux/sunrpc/sched.h | 2 +- > >>> include/linux/sunrpc/xprtmultipath.h | 9 +- > >>> include/trace/events/sunrpc.h | 4 +- > >>> net/sunrpc/clnt.c | 38 ++- > >>> net/sunrpc/xprtmultipath.c | 91 +++++- > >>> 17 files changed, 913 insertions(+), 103 deletions(-) > >> > >> -- > >> Chuck Lever > > -- > Chuck Lever > >