Thanks very much to all of you for the explanations and concrete suggestions for things to look at, I feel much less stuck! --b. On Tue, May 04, 2021 at 02:27:04PM +0000, Trond Myklebust wrote: > On Tue, 2021-05-04 at 12:08 +1000, NeilBrown wrote: > > On Tue, 04 May 2021, bfields@xxxxxxxxxxxx wrote: > > > On Wed, Jan 20, 2021 at 10:07:37AM -0500, bfields@xxxxxxxxxxxx wrote: > > > > > > > > So mainly: > > > > > > > > > > > Why is there a performance regression being seen by these > > > > > > > setups > > > > > > > when they share the same connection? Is it really the > > > > > > > connection, > > > > > > > or is it the fact that they all share the same fixed-slot > > > > > > > session? > > > > > > > > I don't know. Any pointers how we might go about finding the > > > > answer? > > > > > > I set this aside and then get bugged about it again. > > > > > > I apologize, I don't understand what you're asking for here, but it > > > seemed obvious to you and Tom, so I'm sure the problem is me. Are > > > you > > > free for a call sometime maybe? Or do you have any suggestions for > > > how > > > you'd go about investigating this? > > > > I think a useful first step would be to understand what is getting in > > the way of the small requests. > > - are they in the client waiting for slots which are all consumed by > > large writes? > > - are they in TCP stream behind megabytes of writes that need to be > > consumed before they can even be seen by the server? > > - are they in a socket buffer on the server waiting to be served > > while all the nfsd thread are busy handling writes? > > > > I cannot see an easy way to measure which it is. > > The nfs4_sequence_done tracepoint will give you a running count of the > highest slot id in use. > > The mountstats 'execute time' will give you the time between the > request being created and the time a reply was received. That time > includes the time spent waiting for a NFSv4 session slot. > > The mountstats 'backlog wait' will tell you the time spent waiting for > an RPC slot after obtaining the NFSv4 session slot. > > The mountstats 'RTT' will give you the time spend waiting for the RPC > request to be received, processed and replied to by the server. > > Finally, the mountstats also tell you average per-op bytes sent/bytes > received. > > IOW: The mountstats really gives you almost all the information you > need here, particularly if you use it in the 'interval reporting' mode. > The only thing it does not tell you is whether or not the NFSv4 session > slot table is full (which is why you want the tracepoint). > > > I guess monitoring how much of the time that the client has no free > > slots might give hints about the first. If there are always free > > slots, > > the first case cannot be the problem. > > > > With NFSv3, the slot management happened at the RPC layer and there > > were > > several queues (RPC_PRIORITY_LOW/NORMAL/HIGH/PRIVILEGED) where requests > > could wait for a free slot. Since we gained dynamic slot allocation - > > up to 65536 by default - I wonder if that has much effect any more. > > > > For NFSv4.1+ the slot management is at the NFS level. The server sets > > a > > maximum which defaults to (maybe is limited to) 1024 by the Linux > > server. > > So there are always free rpc slots. > > The Linux client only has a single queue for each slot table, and I > > think there is one slot table for the forward channel of a session. > > So it seems we no longer get any priority management (sync writes used > > to get priority over async writes). > > > > Increasing the number of slots advertised by the server might be > > interesting. It is unlikely to fix anything, but it might move the > > bottle-neck. > > > > Decreasing the maximum of number of tcp slots might also be interesting > > (below the number of NFS slots at least). > > That would allow the RPC priority infrastructure to work, and if the > > large-file writes are async, they might gets slowed down. > > > > If the problem is in the TCP stream (which is possible if the relevant > > network buffers are bloated), then you'd really need multiple TCP > > streams > > (which can certainly improve throughput in some cases). That is what > > nconnect give you. nconnect does minimal balancing. It general it > > will > > round-robin, but if the number of requests (not bytes) queued on one > > socket is below average, that socket is likely to get the next request. > > It's not round-robin. Transports are allocated to a new RPC request > based on a measure of their queue length in order to skip over those > that show signs of above average congestion. > > > So just adding more connections with nconnect is unlikely to help. > > You > > would need to add a policy engine (struct rpc_xpr_iter_ops) which > > reserves some connections for small requests. That should be fairly > > easy to write a proof-of-concept for. > > Ideally we would want to tie into cgroups as the control mechanism so > that NFS can be treated like any other I/O resource. > > > > > NeilBrown > > > > > > > > > > Would it be worth experimenting with giving some sort of advantage > > > to > > > readers? (E.g., reserving a few slots for reads and getattrs and > > > such?) > > > > > > --b. > > > > > > > It's easy to test the case of entirely seperate state & tcp > > > > connections. > > > > > > > > If we want to test with a shared connection but separate slots I > > > > guess > > > > we'd need to create a separate session for each nfs4_server, and > > > > a lot > > > > of functions that currently take an nfs4_client would need to > > > > take an > > > > nfs4_server? > > > > > > > > --b. > > > > > > > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >