On Thu, 2020-11-26 at 08:48 -0500, Trond Myklebust wrote: > On Thu, 2020-11-26 at 12:47 +0200, Dan Aloni wrote: > > Hi Scott, Trond, > > > > Commit ce368536dd614452407dc31e2449eb84681a06af ("nfs: > > nfs_file_write() > > should check for writeback errors") seems to have affected NFS v3 > > soft > > mount behavior, causing applications to fail on a slow band > > connection > > with a properly functioning server. I checked this with recent > > Linux > > 5.10-rc5, and on 5.8.18 to where this commit is backported. > > > > Question: while the NFS v4 protocol talks about a soft mount > > timeout > > behavior at "RFC7530 section 3.1.1" (see reference and patchset > > addressing it in [1]), is it valid to assume that a similar > > guarantee > > for NFS v3 soft mounts is expected? > > > > The reason why it is important, is because the fulfilment of this > > guarantee seemed to have changed with this recent patch. > > > > Details on reproduction - using the following mount option: > > > > > > vers=3,rsize=1048576,wsize=1048576,soft,proto=tcp,timeo=50,retrans= > > 16 > > Sorry, but those are completely silly timeo and retrans values for a > TCP connection. I see no reason why we should try to support them. To clarify _why_ the values make no sense: timeo=50 means "I expect that all my RPC requests are normally processed by the server, and a reply will be sent within 5 seconds whether or not the server is congested". I suggest you look at your nfsiostats output to see if that kind of latency expectancy is really warranted (look at the maximum latency values). retrans=16 means "however I expect my server to drop RPC requests so often, that some requests need to be retransmitted 16 times in order to compensate" Dropping requests is typically something rare on a server. It can happen when the server is congested, but usually that will also cause the server to drop the connection as well. I suggest you check your nfsstats on the server to see if that is really the case. > > > > > This is done along with rate limiting on the outgoing interface: > > > > tc qdisc add dev eth0 root tbf rate 4000kbit latency 1ms burst > > 1540 > > > > And performing following parallel work on the mountpoint: > > > > for i in `seq 1 100` ; do (dd if=/dev/zero of=x$i &) ; done > > > > Result is that EIOs are returned to `dd`, whereas without this > > commit > > the IOs simply performed slowly, and no errors observed by dd. It > > appears in traces that the NFS layer is doing the retries. > > > > [1] > > https://patchwork.kernel.org/project/linux-nfs/cover/20190328205239.29674-1-trond.myklebust@xxxxxxxxxxxxxxx/ > > > > Yes. If you artificially create congestion by telling the client to > keep resending all your outstanding data every 5 seconds, then it is > trivial to set up this kind of situation. That has always been the > case, and the patch you point to has nothing to do with this. > -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx