Recently a customer reported a regression which was the result of a small portion of 9b30889c548a which changed the handling of TCP_CLOSE inside xs_tcp_state_change. A simple reproducer was shown where an unexpected RST injected during a simple write test would cause the NFS client to hang, a kworker thread spinning at 100% cpu, and the only recovery was to reboot. Upon investigation we found that the XPRT_CONNECTED bit was never cleared on the rpc_xprt and as a result, rpc_tasks would loop around with the following rpcdebug, which shows the rpc_xprt still has the XPRT_CONNECTED bit set even though EPIPE (32) is coming back from the TCP layer: kernel: RPC: 92 xprt_transmit(524516) kernel: RPC: xs_tcp_send_request(524516) = -32 kernel: RPC: 92 call_status (status -32) kernel: RPC: 92 call_bind (status 0) kernel: RPC: 92 call_connect xprt ffff8ff8abf44000 is connected kernel: RPC: 92 call_transmit (status 0) kernel: RPC: 92 xprt_prepare_transmit kernel: RPC: 92 rpc_xdr_encode (status 0) kernel: RPC: 92 marshaling UNIX cred ffff8ff835db5000 kernel: RPC: 92 using AUTH_UNIX cred ffff8ff835db5000 to wrap rpc data kernel: RPC: 92 xprt_transmit(524516) kernel: RPC: xs_tcp_send_request(524516) = -32 The explanation of the above is as follows. Prior to 9b30889c548a, XPRT_CONNECTED was unconditionally cleared while handling TCP_CLOSE inside xs_tcp_state_change through the following call chain: xs_sock_mark_closed -> xprt_disconnect_done After the patch, while handling TCP_CLOSE, we have the following chain: xs_tcp_force_close -> xprt_force_disconnect The latter will queue a work xprt_autoclose based on the XPRT_LOCKED bit. If the xprt is locked at this point (for example due to another rpc_task in the middle of transmission), the work won't get queued and the XPRT_CONNECTED bit never gets cleared. The below reproducer will trigger the problem on 4.16 - 4.19 and the 1-line patch that follows fixes the problem. So far I have not been able to reproduce the problem on 4.20-rc6. However, the patch makes handling of TCP_CLOSE consistent with that of TCP_CLOSE_WAIT inside xs_tcp_state_change. On the basis of consistency / correctness, and given the impact to 4.16 - 4.19, I submit the patch that follows for consideration for upstream as well as stable branches of 4.16 - 4.19. #!/bin/bash # Reproduces NFS client hang due to sunrpc state machine spinning in # transmit side due to incorrect handling of XPRT_CONNECTED in presence # of unsolicited RST. # Tested (fails) on: 4.16, 4.17, 4.18, 4.19, 3.10.0-957.el7 # # Author: Frank Sorenson <sorenson@xxxxxxxxxx> # From: Deependra Shekhawat <deepens@xxxxxxxxxx> # Modifications: Dave Wysochanski <dwysocha@xxxxxxxxxx> # install_dependency() { /usr/bin/which $1 > /dev/null 2>&1 [ $? -ne 0 ] && echo "Installing dependency $1" && yum install -y $2 } install_dependency tshark wireshark install_dependency screen screen exports_dir=/exports2 nfsmp=/mnt/tmp blocked_sleep_time=10 test_sleep_time1=10 test_sleep_time2=10 LOG_OVER_NET=1 # whether to log by sending to udp/514 (syslog) on HOST1 (either # enable remote syslog, or just add 'or udp port 514' to the packet # capture filter. Goes to /dev/null, if set to 0. # pcap_pid="" if [[ $LOG_OVER_NET -ne 0 ]] ; then # probably best to be the network address of some real network, but not a real host address # FIXME: un-comment the following in /etc/rsyslog.conf #$ModLoad imudp #$UDPServerRun 514 systemctl start rsyslog exec 6<>/dev/udp/127.0.0.1/514 else exec 6<>/dev/null fi trap exit_cleanup EXIT exit_cleanup() { iptables -D OUTPUT -p tcp -m tcp --dport 2049 -j REJECT --reject-with tcp-reset [[ -n $pcap_pid ]] && kill $pcap_pid screen -S dd_loop -X kill >/dev/null 2>&1 rpcdebug -m rpc -c all >/dev/null rpcdebug -m nfs -c all >/dev/null exec 6>&- } output() { echo -en "$*" # echo -en "$0: $*" | tr -d '\n\t' | tee >(logger -t nfs_hang_test) >&6 echo -en "$0: $*" | tr -d '\n\t' | logger -t nfs_hang_test echo -en "$0: $*" | tr -d '\n\t' >&6 } tshark -i lo -Qw /tmp/trace.pcap tcp port 2049 or udp port 514 & pcap_pid=$! systemctl start nfs-server sleep 2 output "performing setup and mount\n" rpcdebug -m rpc -s all rpcdebug -m nfs -s all # Either add nfs to firewall or use 'insecure' on mount and 'noresvport' on mount firewall-cmd --permanent --add-service=nfs systemctl restart firewalld mkdir -p $exports_dir exportfs -o rw,sync,no_root_squash 127.0.0.1:$exports_dir mkdir -p $nfsmp mount -overs=4.1 127.0.0.1:$exports_dir $nfsmp mkdir -p $nfsmp$exports_dir/dir output "starting dd loop\n" sleep 1 screen -d -m -S dd_loop /bin/bash -c "cd $nfsmp ; while true ; do date ; time dd if=/dev/zero oflag=direct of=$nfsmp$exports_dir/dir/file bs=16M count=100 2>/dev/null ; done" output "dd loop started\n" sleep 2 output "enabling firewall rule\n" iptables -A OUTPUT -p tcp -m tcp --dport 2049 -j REJECT --reject-with tcp-reset output "firewall rule added; sleeping $blocked_sleep_time\n" sleep $blocked_sleep_time output "removing firewall rule\n" iptables -D OUTPUT -p tcp -m tcp --dport 2049 -j REJECT --reject-with tcp-reset output "firewall rule removed\n" output "sleeping $test_sleep_time1\n" sleep $test_sleep_time1 output "checking response\n" timeout -k 10 10 /bin/ls $nfsmp$exports_dir/dir/file >/dev/null 2>&1 res=$? if [[ $res -eq 0 ]] ; then output "OK on `uname -r`: ls appears okay\n" else output "WARNING on `uname -r`: ls appears hung\n" fi output "killing dd loop\n" screen -S dd_loop -X kill output "sleeping $test_sleep_time2\n" sleep $test_sleep_time2 output "killing pcap\n" kill $pcap_pid output "stopping rpcdebug\n" rpcdebug -m rpc -c all rpcdebug -m nfs -c all output "checking umount\n" timeout -k 10 10 umount $nfsmp >/dev/null 2>&1 res=$? if [[ $res -eq 0 ]] ; then output "TEST PASS on `uname -r`: umount ok\n" else output "TEST FAIL on `uname -r`: umount hung\n" fi output "test complete\n"