Re: unsharing tcp connections from different NFS mounts

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 20 Jan 2021 15:58:12 +0000

> On Jan 19, 2021, at 5:22 PM, bfields@xxxxxxxxxxxx wrote:
> 
> On Wed, Oct 07, 2020 at 04:50:26PM +0000, Trond Myklebust wrote:
>> As far as I can tell, this thread started with a complaint that
>> performance suffers when we don't allow setups that hack the client by
>> pretending that a multi-homed server is actually multiple different
>> servers.
>> 
>> AFAICS Tom Talpey's question is the relevant one. Why is there a
>> performance regression being seen by these setups when they share the
>> same connection? Is it really the connection, or is it the fact that
>> they all share the same fixed-slot session?
>> 
>> I did see Igor's claim that there is a QoS issue (which afaics would
>> also affect NFSv3), but why do I care about QoS as a per-mountpoint
>> feature?
> 
> Sorry for being slow to get back to this.
> 
> Some more details:
> 
> Say an NFS server exports /data1 and /data2.
> 
> A client mounts both.  Process 'large' starts creating 10G+ files in
> /data1, queuing up a lot of nfs WRITE rpc_tasks.
> 
> Process 'small' creates a lot of small files in /data2, which requires a
> lot of synchronous rpc_tasks, each of which wait in line with the large
> WRITE tasks.
> 
> The 'small' process makes painfully slow progress.
> 
> The customer previously made things work for them by mounting two
> different server IP addresses, so the "small" and "large" processes
> effectively end up with their own queues.
> 
> Frank Sorenson has a test showing the difference; see
> 
> 	https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c42
> 	https://bugzilla.redhat.com/show_bug.cgi?id=1703850#c43
> 
> In that test, the "small" process creates files at a rate thousands of
> times slower when the "large" process is also running.
> 
> Any suggestions?

Based on observation, there is a bottleneck in svc_recv which fully
serializes the receipt of RPC messages on a TCP socket. Large NFS
WRITE requests take longer to remove from the socket, and only one
nfsd can access that socket at a time.

Directing the large operations to a different socket means one nfsd
at a time can service those operations while other nfsd threads can
deal with the burst of small operations.

I don't know of any way to fully address this issue with a socket
transport other than by creating more transport sockets.

For RPC/RDMA I have some patches which enable svc_rdma_recvfrom()
to clear XPT_BUSY as soon as the ingress Receive buffer is dequeued.

--
Chuck Lever