On Wed, 2021-03-24 at 09:23 -0400, Tom Talpey wrote: > On 3/23/2021 12:14 PM, Chuck Lever III wrote: > > > > > > > On Mar 23, 2021, at 11:57 AM, Nagendra Tomar < > > > Nagendra.Tomar@xxxxxxxxxxxxx> wrote: > > > > > > > > On Mar 23, 2021, at 1:46 AM, Nagendra Tomar > > > > <Nagendra.Tomar@xxxxxxxxxxxxx> wrote: > > > > > > > > > > From: Nagendra S Tomar <natomar@xxxxxxxxxxxxx> > > > > > > > > > > If a clustered NFS server is behind an L4 loadbalancer the > > > > > default > > > > > nconnect roundrobin policy may cause RPC requests to a file > > > > > to be > > > > > sent to different cluster nodes. This is because the source > > > > > port > > > > > would be different for all the nconnect connections. > > > > > While this should functionally work (since the cluster will > > > > > usually > > > > > have a consistent view irrespective of which node is serving > > > > > the > > > > > request), it may not be desirable from performance pov. As an > > > > > example we have an NFSv3 frontend to our Object store, where > > > > > every > > > > > NFSv3 file is an object. Now if writes to the same file are > > > > > sent > > > > > roundrobin to different cluster nodes, the writes become very > > > > > inefficient due to the consistency requirement for object > > > > > update > > > > > being done from different nodes. > > > > > Similarly each node may maintain some kind of cache to serve > > > > > the file > > > > > data/metadata requests faster and even in that case it helps > > > > > to have > > > > > a xprt affinity for a file/dir. > > > > > In general we have seen such scheme to scale very well. > > > > > > > > > > This patch introduces a new rpc_xprt_iter_ops for using an > > > > > additional > > > > > u32 (filehandle hash) to affine RPCs to the same file to one > > > > > xprt. > > > > > It adds a new mount option "ncpolicy=roundrobin|hash" which > > > > > can be > > > > > used to select the nconnect multipath policy for a given > > > > > mount and > > > > > pass the selected policy to the RPC client. > > > > > > > > This sets off my "not another administrative knob that has > > > > to be tested and maintained, and can be abused" allergy. > > > > > > > > Also, my "because connections are shared by mounts of the same > > > > server, all those mounts will all adopt this behavior" > > > > rhinitis. > > > > > > Yes, it's fair to call this out, but ncpolicy behaves like the > > > nconnect > > > parameter in this regards. > > > > > > > And my "why add a new feature to a legacy NFS version" hives. > > > > > > > > > > > > I agree that your scenario can and should be addressed somehow. > > > > I'd really rather see this done with pNFS. > > > > > > > > Since you are proposing patches against the upstream NFS > > > > client, > > > > I presume all your clients /can/ support NFSv4.1+. It's the NFS > > > > servers that are stuck on NFSv3, correct? > > > > > > Yes. > > > > > > > > > > > The flexfiles layout can handle an NFSv4.1 client and NFSv3 > > > > data > > > > servers. In fact it was designed for exactly this kind of mix > > > > of > > > > NFS versions. > > > > > > > > No client code change will be necessary -- there are a lot more > > > > clients than servers. The MDS can be made to work smartly in > > > > concert with the load balancer, over time; or it can adopt > > > > other > > > > clever strategies. > > > > > > > > IMHO pNFS is the better long-term strategy here. > > > > > > The fundamental difference here is that the clustered NFSv3 > > > server > > > is available over a single virtual IP, so IIUC even if we were to > > > use > > > NFSv41 with flexfiles layout, all it can handover to the client > > > is that single > > > (load-balanced) virtual IP and now when the clients do connect to > > > the > > > NFSv3 DS we still have the same issue. Am I understanding you > > > right? > > > Can you pls elaborate what you mean by "MDS can be made to work > > > smartly in concert with the load balancer"? > > > > I had thought there were multiple NFSv3 server targets in play. > > > > If the load balancer is making them look like a single IP address, > > then take it out of the equation: expose all the NFSv3 servers to > > the clients and let the MDS direct operations to each data server. > > > > AIUI this is the approach (without the use of NFSv3) taken by > > NetApp next generation clusters. > > It certainly sounds like the load balancer is actually performing a > storage router function here, and roundrobin is going to thrash that > badly. I'm not sure that exposing a magic "hash" knob is a very good > solution though. Pushing decisions to the sysadmin is rarely a great > approach. > > Why not simply argue that "hash" is the better algorithm, and prove > that it be the default? Is that not the case? > > It's not, no. So we're not making that a default. Pushing all the I/O to a single file through a single TCP only makes sense if you have a setup like the one Nagendra is describing. Otherwise, you're better off spreading it across multiple connections (assuming that you have multiple NICs). > Tom. > > > > > > It adds a new rpc_procinfo member p_fhhash, which can be > > > > > supplied > > > > > by the specific RPC programs to return a u32 hash of the > > > > > file/dir the > > > > > RPC is targetting, and lastly it provides p_fhhash > > > > > implementation > > > > > for various NFS v3/v4/v41/v42 RPCs to generate the hash > > > > > correctly. > > > > > > > > > > Thoughts? > > > > > > > > > > Thanks, > > > > > Tomar > > > > > > > > > > Nagendra S Tomar (5): > > > > > SUNRPC: Add a new multipath xprt policy for xprt selection > > > > > based > > > > > on target filehandle hash > > > > > SUNRPC/NFSv3/NFSv4: Introduce "enum ncpolicy" to represent > > > > > the > > > > nconnect > > > > > policy and pass it down from mount option to rpc layer > > > > > SUNRPC/NFSv4: Rename RPC_TASK_NO_ROUND_ROBIN -> > > > > RPC_TASK_USE_MAIN_XPRT > > > > > NFSv3: Add hash computation methods for NFSv3 RPCs > > > > > NFSv4: Add hash computation methods for NFSv4/NFSv42 RPCs > > > > > > > > > > fs/nfs/client.c | 3 + > > > > > fs/nfs/fs_context.c | 26 ++ > > > > > fs/nfs/internal.h | 2 + > > > > > fs/nfs/nfs3client.c | 4 +- > > > > > fs/nfs/nfs3xdr.c | 154 +++++++++++ > > > > > fs/nfs/nfs42xdr.c | 112 ++++++++ > > > > > fs/nfs/nfs4client.c | 14 +- > > > > > fs/nfs/nfs4proc.c | 18 +- > > > > > fs/nfs/nfs4xdr.c | 516 > > > > > ++++++++++++++++++++++++++++++----- > > > > > fs/nfs/super.c | 7 +- > > > > > include/linux/nfs_fs_sb.h | 1 + > > > > > include/linux/sunrpc/clnt.h | 15 + > > > > > include/linux/sunrpc/sched.h | 2 +- > > > > > include/linux/sunrpc/xprtmultipath.h | 9 +- > > > > > include/trace/events/sunrpc.h | 4 +- > > > > > net/sunrpc/clnt.c | 38 ++- > > > > > net/sunrpc/xprtmultipath.c | 91 +++++- > > > > > 17 files changed, 913 insertions(+), 103 deletions(-) > > > > > > > > -- > > > > Chuck Lever > > > > -- > > Chuck Lever > > > > > > > > -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx