On Tue, 2013-10-29 at 17:42 +-1100, NeilBrown wrote: +AD4- We have a customer who hit a rare race in sunrpc (in a 3.0 based kernel, +AD4- but the relevant code doesn't seem to have changed much). +AD4- +AD4- The thread that crashed was in +AD4- xs+AF8-tcp+AF8-setup+AF8-socket -+AD4- inet+AF8-stream+AF8-connect -+AD4- lock+AF8-sock+AF8-nested. +AD4- +AD4- 'sock' in this last function is NULL. +AD4- +AD4- The only way I can imagine this happening is if some other thread called +AD4- +AD4- xs+AF8-close -+AD4- xs+AF8-reset+AF8-transport -+AD4- sock+AF8-release -+AD4- inet+AF8-release +AD4- +AD4- in a very small window a moment earlier. +AD4- +AD4- As far as I can tell, xs+AF8-close is only called with XPRT+AF8-LOCKED set. +AD4- +AD4- xs+AF8-tcp+AF8-setup+AF8-socket is mostly scheduled with XPRT+AF8-LOCKED set to which would +AD4- exclude them from running at the same time. +AD4- +AD4- +AD4- However xs+AF8-tcp+AF8-schedule+AF8-linger+AF8-timeout can schedule the thread which runs +AD4- xs+AF8-tcp+AF8-setup+AF8-socket without first claiming XPRT+AF8-LOCKED. +AD4- So I assume that is what is happening. +AD4- +AD4- I imagine some race between the client closing the socket, and getting +AD4- TCP+AF8-FIN+AF8-WAIT1 from the server and somehow the two threads racing. +AD4- +AD4- I wonder if it might make sense to always abort 'connect+AF8-worker' in +AD4- xs+AF8-close()? +AD4- I think the connect+AF8-worker really mustn't be running or queued at this point, +AD4- so cancelling it is either a no-op, or vitally important. +AD4- +AD4- So: does the following patch seem reasonable? If so I'll submit it properly +AD4- with a coherent description etc. Hi Neil, Will that do the right thing if the connect+AF8-worker and close are running on the same rpciod thread? I think it should, but I never manage to keep 100+ACU- up to date with the ever changing semantics of cancel+AF8-delayed+AF8-work+AF8-sync() and friends... Cheers, Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust+AEA-netapp.com www.netapp.com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html