On Fri, 2019-02-22 at 12:10 -0500, Olga Kornievskaia wrote: > On Fri, Feb 22, 2019 at 11:32 AM Dave Wysochanski <dwysocha@xxxxxxxxx > m> wrote: > > > > On Fri, 2019-02-22 at 09:44 -0500, Olga Kornievskaia wrote: > > > Hi Dave, > > > > > > A re-producer is a server that sends an ACK to the client's > > > FIN/ACK > > > request and does nothing afterwards (I can reproduce it 100% with > > > a > > > hacked up server. It was discovered with a "broken" server that > > > doesn't fully closes a connection). It leave this client unable > > > to > > > connect to this server again indefinitely/forever/reboot required > > > kind > > > of state. Once it was considered that doing something like that > > > to > > > the > > > client is a form of an attack (denial-of-server) and thus the > > > kernel > > > has a tcp_fin_timeout option after which the kernel will abort > > > the > > > connection. However this only applies to the sockets that have > > > been > > > closed by the client. This is NOT the case. NFS does not close > > > the > > > connection and it ignores kernel's notification of FIN_WAIT2 > > > state. > > > > > > > Interesting. I had the same reproducer but eventually an RST would > > get > > sent from the NFS client due to the TCP keepalives. It sounds like > > that is not the case anymore. When I did my testing was over a > > year > > and a half ago though. > > after the keepalives the TCP also sends a FIN/ACK if the server > properly sent a FIN/ACK back, then keep alive will indeed be > sufficient. Can you check if in your trace server in one time sent > just an ACK but in another case sent the FIN/ACK? > I had two reproducers actually. In both cases the NFS server would never send a FIN. The first bug (https://bugzilla.redhat.com/show_bug.cgi?id=1458421) was about a NFS server crash after being in the half-close state. An NFS4 client without that keepalive patch would hang. This is a valid test case and we check for that now in all releases. Here's the outline: # Outline # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s 0 port 2049 and host $NFS_SERVER # Step 2. On NFS client, mount NFS4 server export with "vers=4,timeo=5" # Step 3. On NFS server, block outgoing FIN packets from the NFS server" # Step 4. On NFS server, using systemtap script, to delay NFS4OPEN_CONFIRM response for a few seconds # Step 5. On NFS client, open a file, which should get delayed and trigger a FIN from the NFS client # Step 6. On NFS server, after a short delay, crash the server by 'echo c > /proc/sysrq-trigger' # Step 7. On NFS client, wait for NFS server to reboot # Step 8. On NFS client, wait for the NFS to reconnect TCP connection # Step 9. On NFS client, wait for NFS4 grace period # Step 10. On NFS client, try a 'ls' of the file created earlier # Step 11. On NFS client, sleep for $DELAY seconds to allow all operations to complete # Step 12. On NFS client, ensure all operations have completed # Step 13. Cleanup The second test case (after the keepalive patch) was arguably invalid test as I used systemtap on the NFS server to never close the socket so it would never send a FIN. This obviously should never happen but at the time I wanted to see if the NFS client would recover. I did that with this: probe module("sunrpc").function("svc_xprt_enqueue") { printf("%s: %s removing XPT_CLOSE from xprt = %p\n", tz_ctime(gettimeofday_s()), ppfunc(), $xprt); $xprt->xpt_flags = $xprt->xpt_flags & ~(1 << 2); } Here's the full outline for this test: # This test checks automatic NFS TCP recovery after a specific failure scenario. # The failure scenario is with an NFSv3 mount that goes idle and the NFS client closes the # TCP connection, and the final FIN from the NFS server gets lost. # This can be a tricky scenario since the NFS client's TCP may get stuck in FIN-WAIT-2 indefinitely. # The NFS client should be able to recover the NFS TCP connection automatically without unmount/mount # cycle or reboot, and the TCP connection should not be stuck in FIN-WAIT-2 or any other non-connected # state. # # Outline # Step 1. On NFS client: tcpdump -U -q -i $INTERFACE -w $PCAP_FILE -s 0 port 2049 and host $NFS_SERVER # Step 2. On NFS client, mount NFS3 server export # Step 3. On NFS server, using systemtap script, force the NFS server to never send a FIN (avoid close) # Step 4. On NFS client, verify the NFS mount with 'ls' and TCP connection in TCP_ESTABLISHED # Step 5. On NFS client, wait up to $time_limit seconds to allow NFS TCP connection to transition to TIME_WAIT state # Step 6. On NFS client, wait up to $time_limit seconds to allow NFS TCP connection to disappear # Step 7. Cleanup # # PASS: The state of NFS's TCP connection is usable by the end of the test (commands don't hang, etc) # FAIL: The state of the NFS TCP connection remains in FIN-WAIT-2 indefinitely > > > One can argue that this is a broken server and we shouldn't > > > bother. > > > But this patch is an attempt to argue that the client still > > > should > > > care and deal with this condition. However, if the community > > > feels > > > that a broken server is a broken server and this form of an > > > attack is > > > not interested, this patch can live will be an archive for later > > > or > > > never. > > > > > > > This isn't IPoIB is it? > > No, just normal TCP. > > > > Actually, fwiw, looking back I had speculated on changes in this > > area. > > I'm adding you to the CC list of this bug which had some of my > > musings > > on it: > > https://bugzilla.redhat.com/show_bug.cgi?id=1466802#c43 > > That bug I ended up closing when we could no longer prove there was > > any > > state where the NFS client could get stuck in FIN_WAIT2 after the > > keepalive patch. > > I can reproduce current behavior with the current upstream code. > > > > It can happen that the server only sends the ACK back to the > > clients > > FIN,ACK so in general the testcase is valid. But then the question > > is > > how long should one wait for the final data and FIN from the > > server, or > > are there ever instances where you shouldn't wait forever? Is > > there a > > way for us to know for sure there is no data left to receive so > > it's > > safe to timeout? No RPCs outstanding? > > Yep all good questions which I don't have answers to. I realize that > for instance, that a server might send an ACK and then send a FIN/ACK > after that. But why is it bad for the client to proactively send an > RST (by shutting down the connection in TCP_FIN_WAIT2 if the client > was shutting down the connection anyway. > I think you could lose data from the server if you RST and don't wait but other than that I don't know. We may be splitting hairs here though if there's no demonstrable harm to the application (NFS). As Trond points out elsewhere reconnect / port reuse may be an issue. When I looked at this in the second bug I was almost convinced that there was a problem with the 'close' method in xs_tcp_ops - we needed to be calling xs_close(), but I ran into some problems and I wasn't sure about the test case. > > I don't claim to know many of the subtleties here as far as would > > the > > server wait forever in LAST_ACK or do implementations eventually > > timeout after some time? Seems like if these last packets get lost > > there is likely a bug somewhere (either firewall or TCP stack, > > etc). > > https://tools.ietf.org/html/rfc793#page-22 > > > > It looks like at least some people are putting timeouts into their > > stacks though I'm not sure that's a good idea or not. > > I saw only one other driver in the kernel that does have handling of > the TCP_FIN_WAIT2 state (driver/crypto/chelsio) ... One final thought - is it possible that we could have a network that does not support TCP keepalives? In that instance, I'll bet we could get an indefinite hang in FIN_WAIT2 on my first test case (server crashed after the half-close state). It does not look like TCP keepalives have to be implemented and maybe some routers / NATs / firewalls disallow them (due to extra traffic)? https://tools.ietf.org/html/rfc1122#page-101