On Wed, 2018-12-12 at 18:02 +0000, Trond Myklebust wrote: > On Wed, 2018-12-12 at 12:47 -0500, Dave Wysochanski wrote: > > On Wed, 2018-12-12 at 16:56 +0000, Trond Myklebust wrote: > > > On Wed, 2018-12-12 at 08:51 -0500, Dave Wysochanski wrote: > > > > Commit 9b30889c548a changed the handling of TCP_CLOSE inside > > > > xs_tcp_state_change. Prior to this change, the XPRT_CONNECTED > > > > bit > > > > was cleared unconditionally inside xprt_disconnect_done, > > > > similar > > > > to the handling of TCP_CLOSE_WAIT. After the change the > > > > clearing > > > > of XPRT_CONNECTED depends on successfully queueing a work based > > > > xprt_autoclose which depends on XPRT_LOCKED and may not happen. > > > > This is significant in the case of an unexpected RST from the > > > > server, as the client will only see xs_tcp_state_change called > > > > with > > > > sk_state == TCP_CLOSE. Restore the unconditional clear_bit on > > > > XPRT_CONNECTED while handling TCP_CLOSE and make it consistent > > > > with handling TCP_CLOSE_WAIT. > > > > > > > > Signed-off-by: Dave Wysochanski <dwysocha@xxxxxxxxxx> > > > > --- > > > > net/sunrpc/xprtsock.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c > > > > index 8a5e823e0b33..b9789036051d 100644 > > > > --- a/net/sunrpc/xprtsock.c > > > > +++ b/net/sunrpc/xprtsock.c > > > > @@ -1492,6 +1492,7 @@ static void xs_tcp_state_change(struct > > > > sock > > > > *sk) > > > > if (sk->sk_err) > > > > xprt_wake_pending_tasks(xprt, -sk- > > > > > sk_err); > > > > > > > > /* Trigger the socket release */ > > > > + clear_bit(XPRT_CONNECTED, &xprt->state); > > > > xs_tcp_force_close(xprt); > > > > } > > > > out: > > > > > > Hi Dave, > > > > > > This isn't needed for 4.20 or newer because call_transmit() will > > > now > > > always call xprt_end_transmit(). I suggest that a stable fix do > > > something similar (perhaps conditional on the error returned by > > > xprt_transmit()?). > > > > > > > Can you explain the handling in xs_tcp_state_change - why > > XPRT_CONNECTED would need to remain set longer in the case of > > handling > > TCP_CLOSE case rather than TCP_CLOSE_WAIT? It seems like if we get > > an > > RST we'll go directly to TCP_CLOSE and why would the XPRT_CONNECTED > > bit > > getting cleared need to be delayed in that case? > > > > I will look into the approach you mention though at the moment I do > > not > > see how it helps because the underlying issue seems to be clearing > > of > > the XPRT_CONNECTED bit. > > > > See xprt_clear_locked(). Dropping the XPRT_LOCK will automatically > trigger autoclose if XPRT_CLOSE_WAIT is set, regardless of whether > or > not there are other tasks waiting for the lock. > > Ok thanks for pointing me in that direction I will investigate. I am actually seeing a hang now with the reproducer on 4.20-rc6 but it's not CPU spinning. Investigating as it was not easy to reproduce but there must still be a race somewhere.