On Mon, 2012-10-22 at 14:03 -0400, Chris Perl wrote: > I believe I may have identified a race condition with the client side > RPC code that comes into play when handling a reconnect. > > First, I'll detail how to reproduce it, then give some details on what I > think is going on. > > > TO REPRODUCE: > > Client is 192.168.1.100 > Server is 192.168.1.200 > > client# showmount -e server > Export list for server: > /export/test 192.168.1.100 > client# mkdir -p /mnt/test > client# mount -t nfs -o \ > ro,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys \ > 192.168.1.200:/export/test /mnt/test > > Note: I have set all the ac* variables to 0 to better be able to trace > the calls my user space program is making to things that go out over the > wire. This is not crucial as the problem still exists with these set to > the default. > > Now, on the client start the attached python program [1] with real time > priority (SCHED_RR), stat'ing a file on the NFS mount: > > client# touch /mnt/test/file.txt > client# chrt 50 ./hang.py /mnt/test/file.txt 10 > > >From here, it is necessary to simulate a network event such that the > client is unable to talk to the server, but the server is still able to > talk to the client. > > client# ip route add blackhole 192.168.1.200/32 && ip route flush cache > > Now, we wait until the userspace program makes its next stat call. Once > it does, restart the NFS server processes on the server: > > server# /etc/init.d/nfs restart > > The idea of restarting the nfs server processes is to ensure that FINs > and/or an RST come across from the server to the client, changing the > state of its socket while its waiting for the response to its call of > `call_transmit' from within `__rpc_execute'. > > Finally back on the client, remove the blackhole route: > > client# ip route del 192.168.1.200/32 && ip route flush cache > > Once the route is removed, you should be able to do things like ping the > server again, run `rpcinfo -p' against it, etc. However, any NFS > operations for this server will continue to hang. The way to recover is > to kill the SCHED_RR process. After that, things will recover. > > > NOTES: > > This setup for reproducing the problem may seem a little silly, but we > had something very similiar happen in our production environment a > couple of weeks ago and it caused us to get into this situation. Note > that it has nothing to do with the server side of things. I've seen > this happen with a linux NFS server (as above) and with a NAS from a > large storage vendor. > > If you run the stat'ing process as a normal process, i.e. SCHED_OTHER, > then everything recovers without intervention (i.e. it all works as it > should). However, I believe the same race condition is still there, but > just doesn't manifest because in this case the stat'ing process does not > have the same scheduling advantage over rpciod. > > We experienced this problem on the following CentOS kernel: > 2.6.32-279.5.2.el6.centos.plus.x86_64, but I've verified that the same > behavior exists as late as 3.7rc2. > > During the hanging, if you watch the network, you can see that the > client continually completes a TCP 3-way handhsake, but then immediately > tears the connection down. > > > MY ATTEMPTED ANALYSIS: > > First, I am no expert and just starting looking at the rpc code in the > kernel about 2 weeks ago. Therefore some or all of what I lay out here > may be wrong or misguided etc. This is just what I've been able to > figure out by looking at the source code and writing a bunch of > SystemTap scripts. > > I can get more detailed if necessary (including sharing the SystemTap > scripts I used to observe this behavior), but don't want to make this > email overly long, so will just touch on the high points: > > - The sleeping process is that is in the middle of waiting for a > response to `call_transmit' is woken up via `xs_error_report' when > the FINs and/or RST comes across the wire. > > - It finds the `tk_status' field of the task set to EAGAIN, and so > retries `call_transmit'. > > - On the next attempt, `tcp_sendmsg' and its call of > `sk_stream_wait_connect' see the socket has an error of ECONNRESET. > Therefore, the next call to `call_status' sees this and sets the > next `tk_action' to be `call_bind'. > > - `call_bind' does its thing and then we come to `call_connect' > > - `call_connect' schedules the execution of `xs_tcp_setup_socket' with > rpciod via rpciod's workqueue and then goes to sleep. > > - Here is where I believe the problem comes about. Our sleeping > thread waiting for the connect to finish is woken up by rpciod > early via the following call stack: > > 0xffffffffa042f840 : rpc_make_runnable+0x0/0x80 [sunrpc] > 0xffffffffa042fb5e : rpc_wake_up_task_queue_locked+0x18e/0x270 [sunrpc] > 0xffffffffa042fc8e : rpc_wake_up_status+0x4e/0x80 [sunrpc] > 0xffffffffa04288ec : xprt_wake_pending_tasks+0x2c/0x30 [sunrpc] > 0xffffffffa042b9fd : xs_error_report+0x4d/0x90 [sunrpc] > 0xffffffff81487436 : tcp_disconnect+0x356/0x410 [kernel] > 0xffffffff814a9c20 : inet_stream_connect+0x50/0x2c0 [kernel] > 0xffffffff8142bb20 : kernel_connect+0x10/0x20 [kernel] > 0xffffffffa042b368 : xs_tcp_setup_socket+0xe8/0x4c0 [sunrpc] > 0xffffffff8108c760 : worker_thread+0x170/0x2a0 [kernel] > 0xffffffff81091d66 : kthread+0x96/0xa0 [kernel] > 0xffffffff8100c14a : child_rip+0xa/0x20 [kernel] > > It appears that with rpciod trying to reconnect, it winds up > indirectly invoking `xs_error_report' via a callback from the > sock/tcp layer. > > - At this point the `tk_status' is set to EAGAIN, so if > `call_connect_status' handled EGAIN by setting `tk_action' to > `call_connect' again AND rpciod was finished trying to execute > `xs_tcp_setup_socket' then we would be ok. > > - However, `call_connect_status' sets `tk_action' to `call_transmit' > and goes on. > > - At this point rpciod is still executing trying to connect the > socket, and eventually will succeed, but now our thread and rpciod > are racing to see who can update/check the state of the socket > first. > > Depending on who wins and what status is returned, > `xs_tcp_send_request' will either tear the connection down and > retry, or it will simply retry without the teardown. We can wind up > in a loop of setup/teardowns and are never able to actually return > from `__rpc_execute.' > > Since we're stuck in `__rpc_execute' and the transport is locked, no > other NFS operations can come in and they all hang. > > Simply changing `call_connect_status' to handle EAGAIN as described > above did "fix" my problem and I could no longer reproduce the issue, > however I believe that could result in another race condition with > multiple rpciod's trying to execute `xs_tcp_setup_socket' at the same > time. > > I think the "right" fix, at least as far as I can tell would be to > ensure that rpciod does not wake up the process sleeping on > `call_connect' until it is completely done with its work (whether that > is success or failure). I believe this coupled with making > `call_connect_status' handle EAGAIN would work. > > Unfortunately, I'm stretching my abilities here and do not really have a > good understanding of how the interactions come together to be able to > submit a patch to fix this. > > > WORKAROUND: > > Based on what I found above I decided to try setting the real time > priority of all my rpciod threads to one higher than that of my real > time stat'ing process (i.e. 51). When I did this, I could no longer > reproduce the problem. Setting the rt priority to 49 however, did allow > it to still happen. I believe this is essentially the same reason that > things seem to work normally for SCHED_OTHER processes. Simply that > they do not have the same scheduling advantage over rpciod and therefore > its likely that rpciod wins and our thread finds the socket state as > either `TCP_SYN_SENT' or `TCP_ESTABLISHED' and everything recovers. > Hi Chris, Thanks for helping to debug this, and for providing such a thorough analysis! This is great work! Does adding a check in xs_error_report() for xprt_connecting(xprt) so that we skip the call to xprt_wake_pending_tasks() if true, help to resolve the race? Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥