On Jun 4, 2015, at 6:14 PM, Guillaume Morin <guillaume@xxxxxxxxxxx> wrote: > On 04 Jun 17:23, Chuck Lever wrote: >>> This just happened during a kernel panic of our nfs server which stayed >>> down for a while, then only a dozen machines could not recover, the rest >>> was fine. So it is definitely not that easy to trigger. >>> >>> So far all my attempts to reproduce this have failed. I tried mostly by >>> setting iptables to send RSTs back to the server randomly using iptables >>> and dropping syns pretty often. If you have any suggestions, that'd be >>> great >> >> Is there a workload running on that mount point? It probably shouldn't >> be idle when you try your experiment. > > I was just running ls -l on some dir. I could do some writes with dd. > The mount point that "froze" was lightly used and mostly for writes > >>> Do you have any thoughts about my impression that there could be race >>> between cancelling the callback in xs_close() that could leave >>> XPRT_CONNECTING on? >> >> I agree that XPRT_CONNECTING is probably the source of the issue. >> >> But xs_tcp_close() can be called directly by autoclose (not likely if >> there are pending RPCs) or transport shutdown (also not likely, same >> reason). I'm skeptical there's a race involving xs_close(). >> >> I'm wondering if there was a missing state change upcall, or the state >> change upcall happened and xs_tcp_cancal_linger_timeout() exited >> without clearing XPRT_CONNECTING. > > I am 100% sure that XPRT_CONNECTING is the issue because 1) the state > had the flag up 2) there was absolutley no nfs network traffic between the > client and the server 3) I "unfroze" the mounts by clearing it manually. > > xs_tcp_cancel_linger_timeout, I think, is guaranteed to clear the flag. I’m speculating based on some comments in the git log, but what if the transport never sees TCP_CLOSE, but rather gets an error_report callback instead? > Either the callback is canceled and it clears the flag or the callback > will do it. I am not sure how this could leave the flag set but I am > not familiar with this code, so I could totally be missing something > obvious. > > xs_tcp_close() is the only thing I have found which cancels the callback > and does not clear the flag. How would xs_tcp_close() be invoked? >> It's rather academic, though. All this code was replaced in 4.0. > > Well, it's not academic for all the users of the stable branches which > might have this bug in the kernel they're running :-) I didn’t mean to be glib. The point is, stable kernels are always fixed by backporting an existing fix from a newer kernel. > If I can reproduce this issue, I will happily test 4.0. Thanks! -- Chuck Lever chucklever@xxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html