On 04/02/2016 10:58 PM, NeilBrown wrote: > On Sun, Feb 14 2016, Richard Laager wrote: > >> [1.] One line summary of the problem: >> >> NFS Client Ignores TCP Resets >> >> [2.] Full description of the problem/report: >> >> Steps to reproduce: >> 1) Mount NFS share from HA cluster with TCP. >> 2) Failover the HA cluster. (The NFS server's IP address moves from one >> machine to the other.) >> 3) Access the mounted NFS share from the client (an `ls` is sufficient). >> >> Expected results: >> Accessing the NFS mount works fine immediately. >> >> Actual results: >> Accessing the NFS mount hangs for 5 minutes. Then the TCP connection >> times out, a new connection is established, and it works fine again. >> >> After the IP moves, the new server responds to the client with TCP RST >> packets, just as I would expect. I would expect the client to tear down >> its TCP connection immediately and re-establish a new one. But it >> doesn't. Am I confused, or is this a bug? >> >> For the duration of this test, all iptables firewalling was disabled on >> the client machine. >> >> I have a packet capture of a minimized test (just a simple ls): >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1542826/+attachment/4571304/+files/dovecot-test.upstream-kernel.pcap > > I notice that the server sends packets from a different MAC address to > the one it advertises in ARP replies (and the one the client sends to). > This is probably normal - maybe you have two interfaces bonded together? > > Maybe it would help to be explicit about the network configuration > between client and server - are there switches? soft or hard? > > Where is tcpdump being run? On the (virtual) client, or on the > (physical) host or elsewhere? Yes, there is link bonding happening on both sides. Details below. This test was run from a VM (for testing purposes), but the problem is equally reproducible on just the host, with or without this VLAN attached to a bridge. That is, whether we put the NFS client IP on bond0 (with no br9 existing) or put it on br9, we get the same behavior using NFS from the host. I believe I was running the packet capture from inside the VM. +------------------------------+ | Host | | | | +------+ | | | VM | | | | | | | | eth0 | | | +------+ | | | VM's eth0 | | | is e.g. | | | vnet0 on | | | the host | | | | | TCP/IP -------+ br9 | | Stack | | | | | | | | | | bond0 | | +-------+------+ | | p5p1 | | p6p1 | | | | | +-------| |-------+ | | 10GbE | | 10GbE | | +----------+ +----------+ | Switch 1 |20Gb| Switch 2 | | |====| | +----------+ +----------+ | | 10GbE | | 10GbE | | +-------| |-------+ | | | | | oce0 | | oce1 | | +-------+------+ | | | ipmp0 | | | | | TCP/IP -------+ | | Stack | | | | Storage Head | +------------------------------+ The switches behave like a single, larger virtual switch. The VM host is doing actual 802.3ad LAG, whereas the storage heads are doing Solaris's link-based IPMP. There are two storage heads, each with two physical interfaces: krls1: oce0: 00:90:fa:34:f3:be oce1: 00:90:fa:34:f3:c2 krls2: oce0: 00:90:fa:34:f3:3e oce1: 00:90:fa:34:f3:42 The failover event in the original packet capture was failing over from krls1 to krls2. ... > If you were up to building your own kernel, I would suggest putting some > printks in tcp_validate_incoming() (in net/ipv4/tcp_input.c). > > Print a message if th->rst is ever set, and another if the > tcp_sequence() test causes it to be discarded. It shouldn't but > something seems to be discarding it somewhere... I added the changes you suggested: --- tcp_input.c.orig 2016-04-07 04:11:07.907669997 -0500 +++ tcp_input.c 2016-04-04 19:41:09.661590000 -0500 @@ -5133,6 +5133,11 @@ { struct tcp_sock *tp = tcp_sk(sk); + if (th->rst) + { + printk(KERN_WARNING "Received RST segment."); + } + /* RFC1323: H1. Apply PAWS check first. */ if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp && tcp_paws_discard(sk, skb)) { @@ -5163,6 +5168,20 @@ &tp->last_oow_ack_time)) tcp_send_dupack(sk, skb); } + if (th->rst) + { + printk(KERN_WARNING "Discarding RST segment due to tcp_sequence()"); + if (before(TCP_SKB_CB(skb)->end_seq, tp->rcv_wup)) + { + printk(KERN_WARNING "RST segment failed before test: %d %d", + TCP_SKB_CB(skb)->end_seq, tp->rcv_wup); + } + if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp))) + { + printk(KERN_WARNING "RST segment failed after test: %d %d %d", + TCP_SKB_CB(skb)->seq, tp->rcv_nxt, tcp_receive_window(tp)); + } + } goto discard; } @@ -5174,10 +5193,13 @@ * else * Send a challenge ACK */ - if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) + if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) { + printk(KERN_WARNING "Accepted RST segment"); tcp_reset(sk); - else + } else { + printk(KERN_WARNING "Sending challenge ACK for RST segment"); tcp_send_challenge_ack(sk, skb); + } goto discard; } ...reordered quoted text... > Can you create a TCP connection to some other port on the server > (telnet? ssh? http?) and see what happens to it on fail-over? > You would need some protocol that the server won't quickly close. > Maybe just "telnet SERVER 2049" and don't type anything until after the > failover. > > If that closes quickly, then maybe it is an NFS bug. If that persists > for a long timeout before closing, then it must be a network bug - > either in the network code or the network hardware. > In that case, netdev@xxxxxxxxxxxxxxx might be the best place to ask. I tried "telnet 10.20.0.30 22". I got the SSH header. I sent no input, forced a storage cluster failover, and then hit enter after the failover was complete. The ssh connection immediately terminated. My tcp_validate_incoming() debugging code, as expected, showed "Received RST segment." and "Accepted RST segment". These correspond to the one RST packet I received on the SSH connection. In a separate failover event, I tested accessing NFS over TCP. I do *not* get "Received RST segment.". So I conclude that tcp_validate_incoming() is not being called. Any thoughts on what that means or where to go from here? -- Richard -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html