On Tue, May 31, 2011 at 3:44 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > On Tue, 31 May 2011 12:45:37 -0700 > Ben Greear <greearb@xxxxxxxxxxxxxxx> wrote: > >> On 05/31/2011 12:36 PM, Steve French wrote: >> > This is on setting up a session, so could be something like: >> > - mount >> > - do write >> > - server crash >> > - attempt to reconnect >> > - socket returns ENOSOCK >> > - attempt to reconnect ... >> > - repeat >> > >> > Is this repeatable enough that we could modify the client to stop on >> > the reconnect to see what is causing the socket to go bad and which >> > operation we are repeating the reconnect on. >> >> Well, ENOTSOCK sounds like a pretty serious coding problem. Maybe >> a use-after-close or something? >> >> At the least, we could look for some particular errors (such as ENOTSOCK) >> and print more info and do a more thorough job of cleaning up. >> >> Maybe a WARN_ON_ONCE() when the rv is ENOTSOCK as well? >> >> Seems we can reproduce this only when our open-filer HA system >> craps itself during failover, but we can get that to happen usually >> within hours, sometimes maybe about a day. And, CIFS errors don't always >> happen when the HA cluster goes bad. >> >> So, I'm happy to test patches, but since it's a bit tricky to >> reproduce this...I'm hoping to get the best info possible with >> each patch iteration! >> > > I had a report of a similar problem on a RHEL5 (2.6.18) kernel: > > https://bugzilla.redhat.com/show_bug.cgi?id=704921 > > In this case, it caused an oops as well. Your problem may or may not be > the same, but if it is, I suspect that the root cause is a lack of > clear locking rules for the TCP_Server_Info->tcpStatus. > > What I think happened in that case was that the client was in the > middle of a NEGOTIATE request and got a response, and another reconnect > occurred while it was processing it. While the client was tearing down > and creating a new socket, the thread that issued the NEGOTIATE on the > previous socket marked the tcpStatus as CifsGood. > > Fixing it looks to be anything but trivial. I'm not even quite sure how > to approach it at this point. Suggestions welcome. I thought the kernel was more recent than that - how recent is the kernel here? I think something related to cifs_sendv returning ENOTSOCK immediately when not reconnected could be related. -- Thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html