On Wed, 19 Oct 2011 00:59:50 +0400 Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote: > 2011/10/19 Steve French <smfrench@xxxxxxxxx>: > > On Tue, Oct 18, 2011 at 3:29 PM, Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote: > >> 2011/10/14 Jeff Layton <jlayton@xxxxxxxxxx>: > >>> On Fri, 14 Oct 2011 14:02:54 +0400 > >>> Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote: > >>> > >>>> Today, I caught it once again and didn't noticed any reconnects (no cERRORs). > >>>> > >>>> It is surely not depends on Jeff's async read patchset, because I used > >>>> my cifs-3.2-current branch. > >>>> > >>>> My branch consists of Steve's master + lockpatchset + smb2 patches. > >>>> From another hand, previously I caught it with Jeff's branch (without > >>>> lockpatchset and smb2 patches). So, that's why the problem is in > >>>> existing cifs code now. > >>>> > >>>> FYI: I checked two files: "buggy" and original, and noticed that the > >>>> difference between them is located in one place - positions from > >>>> 2014442 to 2014569 - 126 differences with two equal holes. > >>>> > >>>> So, 2014569 - 2014442 + 1 = 128 wrong bytes. Ideas? > >>>> > >>> > >>> Good to know, thanks. I also tried reproducing this for a while last > >>> night and was unable to... > >>> > >>> I used this script: > >>> > >>> -------------------------[snip]------------------------------ > >>> #!/bin/bash > >>> > >>> origfile=$1 > >>> destfile=$2 > >>> > >>> origsum=`md5sum $origfile | cut -d' ' -f1` > >>> i=0 > >>> > >>> while true; do > >>> echo $i > >>> rm -f $destfile $origfile.tmp > >>> > >>> dd if=$origfile of=$destfile bs=100000 > >>> if [ $? -ne 0 ]; then > >>> echo "dd1 failed" > >>> exit 1 > >>> fi > >>> > >>> dd if=$destfile of=$origfile.tmp bs=100000 > >>> if [ $? -ne 0 ]; then > >>> echo "dd2 failed" > >>> exit 1 > >>> fi > >>> > >>> destsum=`md5sum $destfile | cut -d' ' -f1` > >> > >> As you have already read $destfile to $origfile.tmp, there is no need > >> to read it again - you only need to calculate md5sum of the > >> origfile.tmp. > >> > >>> if [ "$origsum" != "$destsum" ]; then > >>> echo "md5sums don't match! orig=$origsum dest=$destsum" > >>> stat $origfile > >>> stat $destfile > >>> exit 1 > >>> fi > >>> > >>> i=`expr $i + 1` > >>> done > >>> > >>> -------------------------[snip]------------------------------ > >>> > >>> I ran the above with the first arg set to a ~615M .iso file on local > >>> disk and the second to a file on a cifs mount. > >>> > >>> I ran it against my win2k8 host for several hours and it never failed. > >>> I then tried running it against my Windows 7 home host (running on > >>> bare-metal) and it would run for a little while and would eventually > >>> fail due to the server returning "out of memory" errors. Some of those > >>> would occur on the NEGOTIATE call, so I chalk that up to a Win7 bug. > >>> > >>> I never saw this mismatch, but I think we can try to infer something > >>> from the nature of the failures that Pavel saw... > >>> > >>> Since the file was apparently being written properly, the write phase > >>> seems like it worked correctly. The data all went into the cache, and > >>> then got flushed properly to the server. > >>> > >>> So, it seems likely that the problem is in the read phase of the test. > >>> There are several possibilities: > >>> > >>> 1) we started out doing a cache read, but the cache was invalidated > >>> partway through. "Something happened" and one of the reads got mangled. > >>> > >>> 2) the server sent us a corrupt read for some reason > >>> > >>> 3) lower level networking problem caused a corrupt read > >>> > >>> 4) generic memory corruption in the pagecache of some sort > >>> > >>> ...plus many others... > >>> > >>> The fact that only 127 bytes was corrupt is very odd. It would be > >>> easier to understand if an entire page were bad, or an entire rsize > >>> chunk. > >>> > >>> If you are able to reproduce this again, it might be helpful to see if > >>> that's consistent. Try to nail down the nature of the corruption -- see > >>> how much is different and where the different parts are. That may > >>> help shed light on the problem... > >>> > >>> In any case, this will probably take some digging -- we should probably > >>> open a bug at bugzilla.samba.org and start working on this there. > >>> Pavel, would you mind doing that when you have time? > >>> > >>> Thanks, > >>> -- > >>> Jeff Layton <jlayton@xxxxxxxxxx> > >>> > >> > >> So, after a closer investigating of the problem I figured out that: > >> > >> 1) It always reproduces after I boot the OS, load module, mount share > >> and read the existing file. > >> > >> 2) Network traffics that are caught by wireshark on the server > >> (Windows 7) and the client are different - I checked it and found the > >> same difference in response packets for the area that is different on > >> orig and orig.tmp files (the response packet from the capture on the > >> server was true and the response packet from the capture on the client > >> was failed). > >> > >> 3) The different area is always 128 bytes bounded but appears in > >> different places. > >> > >> 4) It doesn't depends on a maybe broken LAN cable - I used two > >> different ones with the same results. > >> > >> So, I don't think that it's cifs module issue and there is no need to > >> open a bug on bugzilla.samba.org. It seems that it's the problem with > >> the network driver or with the LAN card from my laptop. > >> > >> Make sense? > > > > Yes ... but it brings up the obvious question ... what happens if cifs > > signing is turned on? > > > > I tried it - one "Unexpected SMB signature" message appears in dmesg. > > I found this code in cifs_check_receive: > 503 /* convert the length into a more usable form */ > 504 if (server->sec_mode & (SECMODE_SIGN_REQUIRED | > SECMODE_SIGN_ENABLED)) { > 505 struct kvec iov; > 506 > 507 iov.iov_base = mid->resp_buf; > 508 iov.iov_len = len; > 509 /* FIXME: add code to kill session */ > 510 if (cifs_verify_signature(&iov, 1, server, > 511 > mid->sequence_number + 1) != 0) > 512 cERROR(1, "Unexpected SMB signature"); > 513 } > > So, now we don't do anything with if such a thing happens. My be we > should follow the FIXME comment and kill the session? > Yes, it's been this way for a long time. MS-CIFS doesn't say that we have to abort the connection however... "If the signature received with the message does not match the signature calculated, the message MUST be discarded, and no further processing is done on it. The receiver MAY also terminate the connection by disconnecting the underlying transport connection and cleaning up any state associated with the connection." At the very least we should use this to return an error back to the caller and allow it to deal with that. Maybe then we could also abort the connection after a certain number of signature check failures? I'd hate to abort the socket just for one signature failure, especially when we have apparent hiccups like this in the lower level networking. -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html