Hi Olga, > > You say that it's taken offline. If it's offline there shouldn't be > > anything listening on port 2049. I was only able to reproduce the > > problem when a client is able to send a SYN to the server and not > > getting a reply back. If the server is offline, there will always be a > > reply back (RST or something of the sorts). Client tries a bit but it > > never gets stuck in the rpc_execute() state because it would get a > > reply from the TCP layer. Your stack is where there is no TCP reply > > from the server. > I'm actually claiming their infrastructure is broken. He says the > server is down. If that's the case, the TCP layer will time out fast > and it will not visibly block other mounts. However, if the server is > unresponsive, that's what the provided stack shows, then the TCP > timeout is much larger. I'm saying the server should truly be > unreachable and not unresponsive. I don't think the infrastructure is broken. The behavior we're seeing seems to be correct (explained below). > If the server is offline, there will always be a reply back > (RST or something of the sorts). > He says the server is down. If that's the case, the TCP layer will > time out fast I believe these two statements are incorrect. Let me explain. If the server is offline (for example physically unplugged from power) then this unplugged server cannot reply. It's a brick. CPU, memory, disk and NIC all have zero power. There's no operating system running, the dead server won't see the TCP request packet and cannot reply so the following cannot be true: > If the server is offline, there will always be a reply back Here's what's happening at the TCP layer: I took a TCP packet capture (see attached nfsv4.pcap file) and can see the NFS client(10.162.132.231) attempting a 3-way TCP handshake with the powered-off/offline server(2.2.2.2). The client sends a TCP SYN to the NFS server. But the NFS server is powered off, so the NFS client times out waiting for the TCP SYN-ACK reply. On timeout, the NFS client will retransmit the TCP SYN packet, and eventually time out again waiting for the SYN-ACK reply. This process repeats itself until TCP retransmits are exhausted. Eventually the NFS client mount command gives up (after 3 minutes) and exits. During this 3 minute period every other NFS mount command on the host where the NFS client is running is blocked; this effectively caused a denial of service attack since no other user was able to NFS mount anything, including perfectly valid NFS mounts. To make matters worse, after the mount command exited, the workload would retry the powered off mount command again extending the duration of the inadvertent denial of service. > He says the server is down. If that's the case, the TCP layer will > time out fast As described above, the 3-way TCP handshake timeout is relatively slow and the mount command takes 3 minutes to exit. I believe you're thinking of the case when the NFS server is powered-on, but has no Linux process listening on NFS port 2049. In this case the NFS server --will-- reply quickly (within milliseconds) with a TCP RST/Reset packet and the NFS client will quickly exit the mount process with an error code. > There are valid protocol reasons why the NFSv4 client has to check > whether or not the new mount is really talking to the same server but > over a different IP addresses. Hi Trond, I ran the following script: #!/bin/sh -x mount -o vers=4 -v -v -v 2.2.2.2:/fake_path /tmp/mnt.dead & echo PID_OF_DEAD_BG_MOUNT=$! sleep 5 # give time for the first mount to execute in the background mount -o vers=4 -v -v -v 10.188.76.67:/git /tmp/mnt.alive on Ubuntu 21.04, MacOS 11.1 and FreeBSD 11.4. The undesirable blocking behavior only appeared on Ubuntu. MacOs and FreeBSD executed the script fine meaning the 10.188.76.67:/git NFS share immediately and successfully mounted without blocking. On Ubuntu, a user runs "mount <unreachable-ip-address>:<fake-path>" which blocks mounts for every other user on the system, this is undesirable. Our virtual machines basically crashed because we had several hundred to several thousand blocked mount processes preventing workloads from making progress. We'd prefer the mount behavior of the MacOS or FreeBSD implementations, even if it's less secure, since at least it's not taking down our servers with an unintentional DoS attack. Is there any chance of looking at the FreeBSD mount implementation and seeing if it is correct, and if so, have the Linux mount command emulate this behavior? Thanks, Mike p.s. I've attached the following 4 files which were generated on Ubuntu 21.04: 1. mount_hang.sh: script running mount test 2. mount_hang.sh.log: output of mount_hang.sh 3. trace_pipe.txt: output of: cat /sys/kernel/debug/tracing/trace_pipe > trace_pipe.txt with events/sunrpc/enable set to 1 events/nfs4/enable set to 1 4. nfsv4.pcap: output of "tcpdump -v -v -v -v -s 0 -w /tmp/nfsv4.pcap port 2049" The test procedure was: - run mount_hang.sh on the Ubunut 21.04 VM - after seeing the second mount command execute - wait about 5 seconds - press control-C to exit the mount_hang.sh script.
#!/bin/sh -x mount -o vers=4 -v -v -v 2.2.2.2:/fake_path /tmp/mnt.dead & echo PID_OF_DEAD_BG_MOUNT=$! sleep 5 mount -o vers=4 -v -v -v 10.188.76.67:/git /tmp/mnt.alive
Attachment:
mount_hang.sh.log
Description: mount_hang.sh.log
Attachment:
nfsv4.pcap
Description: nfsv4.pcap
<...>-714884 [001] .... 1639365.962466: xprt_create: peer=[2.2.2.2]:2049 state=BOUND <...>-714884 [001] .N.. 1639365.962524: rpc_clnt_new_err: program=nfs server=2.2.2.2 error=-22 <...>-701148 [001] .... 1639365.962530: xprt_destroy: peer=[2.2.2.2]:2049 state=LOCKED|BOUND <...>-714884 [001] .... 1639365.962543: xprt_create: peer=[2.2.2.2]:2049 state=BOUND <...>-714884 [001] .... 1639365.962555: rpc_clnt_new: client=0 peer=[2.2.2.2]:2049 program=nfs server=2.2.2.2 <...>-714884 [001] .... 1639365.962558: rpc_task_begin: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=ACTIVE status=0 action=0x0 <...>-714884 [001] .... 1639365.962559: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_start [sunrpc] <...>-714884 [001] .... 1639365.962560: rpc_request: task:3239@0 nfsv4 NULL (sync) <...>-714884 [001] .... 1639365.962560: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_reserve [sunrpc] <...>-714884 [001] .... 1639365.962562: xprt_reserve: task:3239@0 xid=0xec976bbb <...>-714884 [001] .... 1639365.962562: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_reserveresult [sunrpc] <...>-714884 [001] .... 1639365.962562: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_refresh [sunrpc] <...>-714884 [001] .... 1639365.962564: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_refreshresult [sunrpc] <...>-714884 [001] .... 1639365.962564: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_allocate [sunrpc] <...>-714884 [001] .... 1639365.962566: rpc_buf_alloc: task:3239@0 callsize=56 recvsize=32 status=0 <...>-714884 [001] .... 1639365.962566: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE status=0 action=call_encode [sunrpc] <...>-714884 [001] .... 1639365.962569: rpc_task_run_action: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect [sunrpc] <...>-714884 [001] .... 1639365.962569: xprt_reserve_xprt: task:3239@0 snd_task:3239 <...>-714884 [001] .... 1639365.962570: xprt_connect: peer=[2.2.2.2]:2049 state=LOCKED|BOUND <...>-714884 [001] .... 1639365.962571: rpc_task_sleep: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=RUNNING|ACTIVE|NEED_XMIT|NEED_RECV status=0 timeout=0 queue=xprt_pending <...>-714884 [001] .N.. 1639365.962576: rpc_task_sync_sleep: task:3239@0 flags=NULLCREDS|DYNAMIC|SOFT|SOFTCONN|CRED_NOREF runstate=QUEUED|ACTIVE|NEED_XMIT|NEED_RECV status=0 action=call_connect_status [sunrpc] kworker/u5:1-3258 [001] .... 1639365.962618: rpc_socket_connect: error=-115 socket:[2506590] dstaddr=2.2.2.2/2049 state=2 (CONNECTING) sk_state=2 (SYN_SENT) kworker/u5:1-3258 [001] .... 1639365.962619: xprt_release_xprt: task:4294967295@4294967295 snd_task:4294967295