On Fri, 2019-02-22 at 14:17 -0500, Dave Wysochanski wrote: > On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote: > > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski < > > dwysocha@xxxxxxxxx > > m> wrote: > > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote: > > > > Hi Dave, > > > > > > > > A re-producer is a server that sends an ACK to the client's > > > > FIN/ACK > > > > request and does nothing afterwards (I can reproduce it 100% > > > > with > > > > a > > > > hacked up server. It was discovered with a "broken" server that > > > > doesn't fully closes a connection). It leave this client unable > > > > to > > > > connect to this server again indefinitely/forever/reboot > > > > required > > > > kind > > > > of state. Once it was considered that doing something like that > > > > to > > > > the > > > > client is a form of an attack (denial-of-server) and thus the > > > > kernel > > > > has a tcp_fin_timeout option after which the kernel will abort > > > > the > > > > connection. However this only applies to the sockets that have > > > > been > > > > closed by the client. This is NOT the case. NFS does not close > > > > the > > > > connection and it ignores kernel's notification of FIN_WAIT2 > > > > state. > > > > > > > > > > Interesting. I had the same reproducer but eventually an RST > > > would > > > get > > > sent from the NFS client due to the TCP keepalives. It sounds > > > like > > > that is not the case anymore. When I did my testing was over a > > > year > > > and a half ago though. > > > > after the keepalives the TCP also sends a FIN/ACK if the server > > properly sent a FIN/ACK back, then keep alive will indeed be > > sufficient. Can you check if in your trace server in one time sent > > just an ACK but in another case sent the FIN/ACK? > > > > I had two reproducers actually. In both cases the NFS server would > never send a FIN. > > The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421) > was > about a NFS server crash after being in the half-close state. An > NFS4 > client without that keepalive patch would hang. > This is a valid test case and we check for that now in all releases. > Here's the outline: > # Outline > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s > 0 port 2049 and host $NFS_SERVER > # Step 2. On NFS client, mount NFS4 server export with > "vers=4,timeo=5" > # Step 3. On NFS server, block outgoing FIN packets from the NFS > server" > # Step 4. On NFS server, using systemtap script, to delay > NFS4OPEN_CONFIRM response for a few seconds > # Step 5. On NFS client, open a file, which should get delayed and > trigger a FIN from the NFS client > # Step 6. On NFS server, after a short delay, crash the server by > 'echo c > /proc/sysrq-trigger' > # Step 7. On NFS client, wait for NFS server to reboot > # Step 8. On NFS client, wait for the NFS to reconnect TCP connection > # Step 9. On NFS client, wait for NFS4 grace period > # Step 10. On NFS client, try a 'ls' of the file created earlier > # Step 11. On NFS client, sleep for $DELAY seconds to allow all > operations to complete > # Step 12. On NFS client, ensure all operations have completed > # Step 13. Cleanup > > > The second test case (after the keepalive patch) was arguably invalid > test as I used systemtap on the NFS server to never close the socket > so > it would never send a FIN. This obviously should never happen but at > the time I wanted to see if the NFS client would recover. > I did that with this: > probe module("sunrpc").function("svc_xprt_enqueue") { > printf("%s: %s removing XPT_CLOSE from xprt = %p\n", > tz_ctime(gettimeofday_s()), ppfunc(), $xprt); > $xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2); > } > Here's the full outline for this test: > > # This test checks automatic NFS TCP recovery after a specific > failure scenario. > # The failure scenario is with an NFSv3 mount that goes idle and the > NFS client closes the > # TCP connection, and the final FIN from the NFS server gets lost. > # This can be a tricky scenario since the NFS client's TCP may get > stuck in FIN-WAIT-2 indefinitely. > # The NFS client should be able to recover the NFS TCP connection > automatically without unmount/mount > # cycle or reboot, and the TCP connection should not be stuck in FIN- > WAIT-2 or any other non-connected > # state. > # > # Outline > # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s > 0 port 2049 and host $NFS_SERVER > # Step 2. On NFS client, mount NFS3 server export > # Step 3. On NFS server, using systemtap script, force the NFS server > to never send a FIN (avoid close) > # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP > connection in TCP_ESTABLISHED > # Step 5. On NFS client, wait up to $time_limit seconds to allow NFS > TCP connection to transition to TIME_WAIT state > # Step 6. On NFS client, wait up to $time_limit seconds to allow NFS > TCP connection to disappear > # Step 7. Cleanup > # > # PASS: The state of NFS's TCP connection is usable by the end of the > test (commands don't hang, etc) > # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2 > indefinitely > > > > > > One can argue that this is a broken server and we shouldn't > > > > bother. > > > > But this patch is an attempt to argue that the client still > > > > should > > > > care and deal with this condition. However, if the community > > > > feels > > > > that a broken server is a broken server and this form of an > > > > attack is > > > > not interested, this patch can live will be an archive for > > > > later > > > > or > > > > never. > > > > > > > > > > This isn't IPoIB is it? > > > > No, just normal TCP. > > > > > > > Actually, fwiw, looking back I had speculated on changes in this > > > area. > > > I'm adding you to the CC list of this bug which had some of my > > > musings > > > on it: > > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43 > > > That bug I ended up closing when we could no longer prove there > > > was > > > any > > > state where the NFS client could get stuck in FIN_WAIT2 after the > > > keepalive patch. > > > > I can reproduce current behavior with the current upstream code. > > > > > > > It can happen that the server only sends the ACK back to the > > > clients > > > FIN,ACK so in general the testcase is valid. But then the > > > question > > > is > > > how long should one wait for the final data and FIN from the > > > server, or > > > are there ever instances where you shouldn't wait forever? Is > > > there a > > > way for us to know for sure there is no data left to receive so > > > it's > > > safe to timeout? No RPCs outstanding? > > > > Yep all good questions which I don't have answers to. I realize > > that > > for instance, that a server might send an ACK and then send a > > FIN/ACK > > after that. But why is it bad for the client to proactively send an > > RST (by shutting down the connection in TCP_FIN_WAIT2 if the client > > was shutting down the connection anyway. > > > I think you could lose data from the server if you RST and don't wait > but other than that I don't know. We may be splitting hairs here > though if there's no demonstrable harm to the application (NFS). As > Trond points out elsewhere reconnect / port reuse may be an issue. > > When I looked at this in the second bug I was almost convinced that > there was a problem with the 'close' method in xs_tcp_ops - we needed > to be calling xs_close(), but I ran into some problems and I wasn't > sure about the test case. > > > > > I don't claim to know many of the subtleties here as far as would > > > the > > > server wait forever in LAST_ACK or do implementations eventually > > > timeout after some time? Seems like if these last packets get > > > lost > > > there is likely a bug somewhere (either firewall or TCP stack, > > > etc). > > > https://tools.ietf.org/html/rfc793#page-22 > > > > > > It looks like at least some people are putting timeouts into > > > their > > > stacks though I'm not sure that's a good idea or not. > > > > I saw only one other driver in the kernel that does have handling > > of > > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ... > > One final thought - is it possible that we could have a network that > does not support TCP keepalives? In that instance, I'll bet we could > get an indefinite hang in FIN_WAIT2 on my first test case (server > crashed after the half-close state). > > It does not look like TCP keepalives have to be implemented and maybe > some routers / NATs / firewalls disallow them (due to extra traffic)? > https://tools.ietf.org/html/rfc1122#page-101 Again, that's a corner case of a corner case. I've never seen any reports of this occurring. On the other hand, I _have_ seen lots of reports of problems due to the reconnection issues when the server still holds the socket open. Those reports were the main reason for the current design. -- Trond Myklebust CTO, Hammerspace Inc 4300 El Camino Real, Suite 105 Los Altos, CA 94022 www.hammer.space