I believe I may have identified a race condition with the client side RPC code that comes into play when handling a reconnect. First, I'll detail how to reproduce it, then give some details on what I think is going on. TO REPRODUCE: Client is 192.168.1.100 Server is 192.168.1.200 client# showmount -e server Export list for server: /export/test 192.168.1.100 client# mkdir -p /mnt/test client# mount -t nfs -o \ ro,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys \ 192.168.1.200:/export/test /mnt/test Note: I have set all the ac* variables to 0 to better be able to trace the calls my user space program is making to things that go out over the wire. This is not crucial as the problem still exists with these set to the default. Now, on the client start the attached python program [1] with real time priority (SCHED_RR), stat'ing a file on the NFS mount: client# touch /mnt/test/file.txt client# chrt 50 ./hang.py /mnt/test/file.txt 10 >From here, it is necessary to simulate a network event such that the client is unable to talk to the server, but the server is still able to talk to the client. client# ip route add blackhole 192.168.1.200/32 && ip route flush cache Now, we wait until the userspace program makes its next stat call. Once it does, restart the NFS server processes on the server: server# /etc/init.d/nfs restart The idea of restarting the nfs server processes is to ensure that FINs and/or an RST come across from the server to the client, changing the state of its socket while its waiting for the response to its call of `call_transmit' from within `__rpc_execute'. Finally back on the client, remove the blackhole route: client# ip route del 192.168.1.200/32 && ip route flush cache Once the route is removed, you should be able to do things like ping the server again, run `rpcinfo -p' against it, etc. However, any NFS operations for this server will continue to hang. The way to recover is to kill the SCHED_RR process. After that, things will recover. NOTES: This setup for reproducing the problem may seem a little silly, but we had something very similiar happen in our production environment a couple of weeks ago and it caused us to get into this situation. Note that it has nothing to do with the server side of things. I've seen this happen with a linux NFS server (as above) and with a NAS from a large storage vendor. If you run the stat'ing process as a normal process, i.e. SCHED_OTHER, then everything recovers without intervention (i.e. it all works as it should). However, I believe the same race condition is still there, but just doesn't manifest because in this case the stat'ing process does not have the same scheduling advantage over rpciod. We experienced this problem on the following CentOS kernel: 2.6.32-279.5.2.el6.centos.plus.x86_64, but I've verified that the same behavior exists as late as 3.7rc2. During the hanging, if you watch the network, you can see that the client continually completes a TCP 3-way handhsake, but then immediately tears the connection down. MY ATTEMPTED ANALYSIS: First, I am no expert and just starting looking at the rpc code in the kernel about 2 weeks ago. Therefore some or all of what I lay out here may be wrong or misguided etc. This is just what I've been able to figure out by looking at the source code and writing a bunch of SystemTap scripts. I can get more detailed if necessary (including sharing the SystemTap scripts I used to observe this behavior), but don't want to make this email overly long, so will just touch on the high points: - The sleeping process is that is in the middle of waiting for a response to `call_transmit' is woken up via `xs_error_report' when the FINs and/or RST comes across the wire. - It finds the `tk_status' field of the task set to EAGAIN, and so retries `call_transmit'. - On the next attempt, `tcp_sendmsg' and its call of `sk_stream_wait_connect' see the socket has an error of ECONNRESET. Therefore, the next call to `call_status' sees this and sets the next `tk_action' to be `call_bind'. - `call_bind' does its thing and then we come to `call_connect' - `call_connect' schedules the execution of `xs_tcp_setup_socket' with rpciod via rpciod's workqueue and then goes to sleep. - Here is where I believe the problem comes about. Our sleeping thread waiting for the connect to finish is woken up by rpciod early via the following call stack: 0xffffffffa042f840 : rpc_make_runnable+0x0/0x80 [sunrpc] 0xffffffffa042fb5e : rpc_wake_up_task_queue_locked+0x18e/0x270 [sunrpc] 0xffffffffa042fc8e : rpc_wake_up_status+0x4e/0x80 [sunrpc] 0xffffffffa04288ec : xprt_wake_pending_tasks+0x2c/0x30 [sunrpc] 0xffffffffa042b9fd : xs_error_report+0x4d/0x90 [sunrpc] 0xffffffff81487436 : tcp_disconnect+0x356/0x410 [kernel] 0xffffffff814a9c20 : inet_stream_connect+0x50/0x2c0 [kernel] 0xffffffff8142bb20 : kernel_connect+0x10/0x20 [kernel] 0xffffffffa042b368 : xs_tcp_setup_socket+0xe8/0x4c0 [sunrpc] 0xffffffff8108c760 : worker_thread+0x170/0x2a0 [kernel] 0xffffffff81091d66 : kthread+0x96/0xa0 [kernel] 0xffffffff8100c14a : child_rip+0xa/0x20 [kernel] It appears that with rpciod trying to reconnect, it winds up indirectly invoking `xs_error_report' via a callback from the sock/tcp layer. - At this point the `tk_status' is set to EAGAIN, so if `call_connect_status' handled EGAIN by setting `tk_action' to `call_connect' again AND rpciod was finished trying to execute `xs_tcp_setup_socket' then we would be ok. - However, `call_connect_status' sets `tk_action' to `call_transmit' and goes on. - At this point rpciod is still executing trying to connect the socket, and eventually will succeed, but now our thread and rpciod are racing to see who can update/check the state of the socket first. Depending on who wins and what status is returned, `xs_tcp_send_request' will either tear the connection down and retry, or it will simply retry without the teardown. We can wind up in a loop of setup/teardowns and are never able to actually return from `__rpc_execute.' Since we're stuck in `__rpc_execute' and the transport is locked, no other NFS operations can come in and they all hang. Simply changing `call_connect_status' to handle EAGAIN as described above did "fix" my problem and I could no longer reproduce the issue, however I believe that could result in another race condition with multiple rpciod's trying to execute `xs_tcp_setup_socket' at the same time. I think the "right" fix, at least as far as I can tell would be to ensure that rpciod does not wake up the process sleeping on `call_connect' until it is completely done with its work (whether that is success or failure). I believe this coupled with making `call_connect_status' handle EAGAIN would work. Unfortunately, I'm stretching my abilities here and do not really have a good understanding of how the interactions come together to be able to submit a patch to fix this. WORKAROUND: Based on what I found above I decided to try setting the real time priority of all my rpciod threads to one higher than that of my real time stat'ing process (i.e. 51). When I did this, I could no longer reproduce the problem. Setting the rt priority to 49 however, did allow it to still happen. I believe this is essentially the same reason that things seem to work normally for SCHED_OTHER processes. Simply that they do not have the same scheduling advantage over rpciod and therefore its likely that rpciod wins and our thread finds the socket state as either `TCP_SYN_SENT' or `TCP_ESTABLISHED' and everything recovers. Chris [1] hang.py
#!/usr/bin/python import os import sys import time import datetime f=sys.argv[1] s=float(sys.argv[2]) while True: r = os.stat(f) print "%s %r" % (datetime.datetime.now(), r) time.sleep(s)