Re: NFSv4: Mounting NFS server which is down, blocks all other NFS mounts on same machine

"Benjamin Coddington" <bcodding@xxxxxxxxxx> · Wed, 09 Jun 2021 10:31:34 -0400

On 9 Jun 2021, at 1:31, Michael Wakabayashi wrote:

Hi Olga,

There seems to be a discrepancy between what you're seeing and what 
we're seeing.

So we were wondering if you can you please run these commands in your 
Linux environment and paste the output of the mount command below?
    $ sudo mkdir -p /tmp/mnt.dead
    $ time sudo mount -o vers=4 -vvv 2.2.2.2:/fake_path /tmp/mnt.dead

We'd like the mount command to specifically use "2.2.2.2:/fake_path" 
since we know it is unreachable and outside your subnet.
We're hoping by mounting "2.2.2.2:/fake_path" you'll be able to 
reproduce the same behavior that we're seeing.

Also, if possible, a packet trace would be helpful:
    $ sudo tcpdump -s 0 -w /tmp/nfsv4.pcap port 2049

On my Ubuntu VirtualMachine, I see this output:
    ubuntu@mikes-ubuntu-21-04:~$ time sudo mount -o vers=4 -vvv 
2.2.2.2:/fake_path /tmp/mnt.dead
    mount.nfs: timeout set for Wed Jun  9 05:12:15 2021
    mount.nfs: trying text-based options 
'vers=4,addr=2.2.2.2,clientaddr=10.162.132.231'
    mount.nfs: mount(2): Connection timed out
    mount.nfs: Connection timed out
    real  3m1.257s
    user  0m0.006s
    sys 0m0.007s

Thanks, Mike

It looks to me like you and Olga are seeing the same thing, a wait 
through SYN retries scaling up from initial RTO for the number of 
tcp_syn_retries.

It's not disputed that mounts waiting on the transport layer will block 
other mounts.

It might be able to be changed:  there's this torch:
https://lore.kernel.org/linux-nfs/87378omld4.fsf@xxxxxxxxxxxxxxxxxxxxxxxx/

..or there may be another way we don't have to wait ..

.. or tune tcp_syn_retries.. or RTO.. or something else (eBPF?).

I think we're all strapped for time and problems like this usually get 
fixed by the folks feeling the most pain from them.

Ben