> On Jan 19, 2021, at 5:22 PM, bfields@xxxxxxxxxxxx wrote: > > On Wed, Oct 07, 2020 at 04:50:26PM +0000, Trond Myklebust wrote: >> As far as I can tell, this thread started with a complaint that >> performance suffers when we don't allow setups that hack the client by >> pretending that a multi-homed server is actually multiple different >> servers. >> >> AFAICS Tom Talpey's question is the relevant one. Why is there a >> performance regression being seen by these setups when they share the >> same connection? Is it really the connection, or is it the fact that >> they all share the same fixed-slot session? >> >> I did see Igor's claim that there is a QoS issue (which afaics would >> also affect NFSv3), but why do I care about QoS as a per-mountpoint >> feature? > > Sorry for being slow to get back to this. > > Some more details: > > Say an NFS server exports /data1 and /data2. > > A client mounts both. Process 'large' starts creating 10G+ files in > /data1, queuing up a lot of nfs WRITE rpc_tasks. > > Process 'small' creates a lot of small files in /data2, which requires a > lot of synchronous rpc_tasks, each of which wait in line with the large > WRITE tasks. > > The 'small' process makes painfully slow progress. > > The customer previously made things work for them by mounting two > different server IP addresses, so the "small" and "large" processes > effectively end up with their own queues. > > Frank Sorenson has a test showing the difference; see > > https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c42 > https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c43 > > In that test, the "small" process creates files at a rate thousands of > times slower when the "large" process is also running. > > Any suggestions? Based on observation, there is a bottleneck in svc_recv which fully serializes the receipt of RPC messages on a TCP socket. Large NFS WRITE requests take longer to remove from the socket, and only one nfsd can access that socket at a time. Directing the large operations to a different socket means one nfsd at a time can service those operations while other nfsd threads can deal with the burst of small operations. I don't know of any way to fully address this issue with a socket transport other than by creating more transport sockets. For RPC/RDMA I have some patches which enable svc_rdma_recvfrom() to clear XPT_BUSY as soon as the ingress Receive buffer is dequeued. -- Chuck Lever