Re: CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 24 Dec 2012 09:14:21 -0500

On Sun, 23 Dec 2012 09:10:34 -0500
Jeff Layton <jlayton@xxxxxxxxxx> wrote:

> On Thu, 20 Dec 2012 09:38:06 -0500
> Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> > On Wed, 19 Dec 2012 11:30:32 -0800 (PST)
> > Tim Perry <tim.perry@xxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > 
> > > Dear Jeff, et. al.,
> > > 
> > > 
> > > I can reproduce the problem by starting "find . -name \*.ext"and killing it when connected to either of our two Windows 2003 Servers. I can *not* reproduce it doing the same thing connected to a windows 7 box.
> > > 
> > > $ uname -a
> > > Linux servername 3.2.0-34-generic #53-Ubuntu SMP Thu Nov 15 10:49:02 UTC 2012 i686 i686 i386 GNU/Linux
> > > $ cat /proc/version
> > > 
> > > Linux version 3.2.0-34-generic (buildd@roseapple) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #53-Ubuntu SMP Thu Nov 15 10:49:02 UTC 2012
> > > $ lsb_release -a
> > > No LSB modules are available.
> > > Distributor ID: Ubuntu
> > > Description:    Ubuntu 12.04.1 LTS
> > > Release:        12.04
> > > Codename:       precise
> > > 
> > > 
> > > I tried using strace but hitting ctrl-c killed strace (obviously, oops), but interestingly, this did *not* hang the file system. I will try and kill the find command (kill -9 perhaps?) and see if I can recreate the error that way.
> > > 
> > > CONTINUING HERE:
> > > I don't think strace on the find command will help because it isn't making the network connections. CIFS is making the network connections. Maybe I can cause the mount to happen with an strace version of CIFS?  How would I do that?
> > > 
> > > Anyhow, I opened two terminal windows and proceeded as follows:
> > > 
> > > In terminal 1:
> > > 
> > > $ strace find . -name \*adzzz >& ~/straceFind.txt
> > > 
> > > 
> > > In terminal 2:
> > > $ ps aux | grep find | grep -v strace
> > > perry     2583 12.6  0.0   4792  1088 pts/5    R+   11:27   0:00 find . -name *adzzz
> > > perry     2585  0.0  0.0   4388   828 pts/2    S+   11:27   0:00 grep find
> > > $ kill -9 2583
> > > 
> > > File system dies.
> > > 
> > > I've attaced the straceFind.txt, but it just shows find walking the filesystem tree:
> > > statat64(AT_FDCWD, "0010", {st_mode=S_IFDIR|0777, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
> > > openat(AT_FDCWD, "0010", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 5
> > > fchdir(5)                               = 0
> > > getdents64(5, /* 14 entries */, 32768)  = 448
> > > getdents64(5, /* 0 entries */, 32768)   = 0
> > > close(5)                                = 0
> > > fstatat64(AT_FDCWD, "_vti_cnf", {st_mode=S_IFDIR|0777, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
> > > openat(AT_FDCWD, "_vti_cnf", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 5
> > > fchdir(5)                               = 0
> > > getdents64(5, /* 13 entries */, 32768)  = 416
> > > getdents64(5, /* 0 entries */, 32768)   = 0
> > > close(5)                                = 0
> > > open("..", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW) = 5
> > > fstat64(5, {st_mode=S_IFDIR|0777, st_size=0, ...}) = 0
> > > fchdir(
> > > 
> > > 
> > > Ideas?
> > > 
> > 
> > That kernel is pretty old, so you may want to try a more recent one.
> > 
> > You may first want to start by tracing with wireshark -- see what's
> > happening on the wire before and after the signal is delivered.
> > 
> > If it works against win7 then it's likely that win7 disconnects the
> > socket when the signatures are wrong. With that, we'd reestablish the
> > connection and things would start working again. I suspect that win2k8
> > just starts returning an error that we map to -EACCES.
> > 
> > It's possible that we should disconnect the client when the signatures
> > start looking wrong, but I think we need to understand why signals are
> > causing this issue in the first place.
> > 
> > There are some places where we do interruptible sleeps (vs. killable
> > ones). It's possible that SIGINT (which is what ^c generally delivers)
> > is causing havok there.
> > 
> 
> I had a look at the code today and suspect that I know what the problem
> is. When the kernel goes to send a request, it first signs it and then
> bumps the sequence numbers that it tracks. If the request doesn't
> actually make it out onto the wire, like when the task catches a
> signal, those sequence numbers remain high even though the request
> didn't go out.
> 
> Here's an untested patch that might help tell whether this is the
> case. You may want to try it and see if it does. Note that this fix is
> a bit of a kludge and is not suitable for merging!
> 
> A better fix would involve changing when the sequence number gets
> bumped in the first place. If this patch seems to help things, then
> I'll look at coding up that up.
> 
> diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
> index 76d974c..4520234 100644
> --- a/fs/cifs/transport.c
> +++ b/fs/cifs/transport.c
> @@ -334,10 +334,14 @@ uncork:
>  		server->tcpStatus = CifsNeedReconnect;
>  	}
>  
> -	if (rc < 0 && rc != -EINTR)
> -		cERROR(1, "Error %d sending data on socket to server", rc);
> -	else
> +	if (rc < 0) {
> +		if (rc == -EINTR)
> +			server->sequence_number -= 2;
> +		else
> +			cERROR(1, "Error %d sending data on socket to server", rc);
> +	} else {
>  		rc = 0;
> +	}
>  
>  	return rc;
>  }
> 
> 

I was able to reproduce this, and I don't think the above patch will
fix it (at least not completely). The problem seems to be that the NT
cancel command is screwing up the sequence numbers. We'll have to do
some research to figure out why that's occurring.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html