On Thu, Nov 26, 2020 at 01:48:23PM +0000, Trond Myklebust wrote: > On Thu, 2020-11-26 at 12:47 +0200, Dan Aloni wrote: > > Hi Scott, Trond, > > > > Commit ce368536dd614452407dc31e2449eb84681a06af ("nfs: > > nfs_file_write() > > should check for writeback errors") seems to have affected NFS v3 > > soft > > mount behavior, causing applications to fail on a slow band > > connection > > with a properly functioning server. I checked this with recent Linux > > 5.10-rc5, and on 5.8.18 to where this commit is backported. > > > > Question: while the NFS v4 protocol talks about a soft mount timeout > > behavior at "RFC7530 section 3.1.1" (see reference and patchset > > addressing it in [1]), is it valid to assume that a similar guarantee > > for NFS v3 soft mounts is expected? > > > > The reason why it is important, is because the fulfilment of this > > guarantee seemed to have changed with this recent patch. > > > > Details on reproduction - using the following mount option: > > > > > > vers=3,rsize=1048576,wsize=1048576,soft,proto=tcp,timeo=50,retrans=16 > > Sorry, but those are completely silly timeo and retrans values for a > TCP connection. I see no reason why we should try to support them. The same issue is reproducible with a similar majortimeo effect, for example timeo=400,retrans=1. Now looking under `/sys/kernel/debug`, what I see is an accumulation of RPC tasks that are ready to transmit, by the thousands, and so if the outgoing throughput constraint is such that the amount of WRITE backlog is bigger than what is possible to transmit in the time frame of the majortimeo, the tasks end with EIO. This may sound contrived, but it is achievable with network interfaces of regular throughput, given enough writers. This was not the case prior to Linux v5.1, according to my observation - with the older sunrpc implementation, these tasks would have waited under 'reserved' state, not incurring a timeout calculation on them at all, and the behavior was that tasks move to the transmit stage and start counting down to a timeout only when there's write space on the socket that allows to transmit them. I looked around and saw that many vendors are recommending to change the `sunrpc.tcp_max_slot_table_entries` sysctl to 128 down from 65536. This has the effect that the transmit queue would be small instead of growing to the tens of thousands of tasks, keeping the remaining tasks in the backlog without failure. With the older SunRPC, the 65536 maximum did not matter due to write space restriction, which 'naturally' did that. And indeed, the lower setting is able to fix the issue I originally addressed and help to retain the old behavior, where soft mount's goal (at least in my case) is to detect EIOs that are stuck at the server rather than at the client. -- Dan Aloni