If the session could continue after an incorrect checksum (ie retry the operation that failed due to the network error/corruption) then there would be significant advantage in not taking the session down. In practice would that request time out and get killed and retried? On Tue, Oct 18, 2011 at 3:59 PM, Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote: > 2011/10/19 Steve French <smfrench@xxxxxxxxx>: >> On Tue, Oct 18, 2011 at 3:29 PM, Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote: >>> 2011/10/14 Jeff Layton <jlayton@xxxxxxxxxx>: >>>> On Fri, 14 Oct 2011 14:02:54 +0400 >>>> Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote: >>>> >>>>> Today, I caught it once again and didn't noticed any reconnects (no cERRORs). >>>>> >>>>> It is surely not depends on Jeff's async read patchset, because I used >>>>> my cifs-3.2-current branch. >>>>> >>>>> My branch consists of Steve's master + lockpatchset + smb2 patches. >>>>> From another hand, previously I caught it with Jeff's branch (without >>>>> lockpatchset and smb2 patches). So, that's why the problem is in >>>>> existing cifs code now. >>>>> >>>>> FYI: I checked two files: "buggy" and original, and noticed that the >>>>> difference between them is located in one place - positions from >>>>> 2014442 to 2014569 - 126 differences with two equal holes. >>>>> >>>>> So, 2014569 - 2014442 + 1 = 128 wrong bytes. Ideas? >>>>> >>>> >>>> Good to know, thanks. I also tried reproducing this for a while last >>>> night and was unable to... >>>> >>>> I used this script: >>>> >>>> -------------------------[snip]------------------------------ >>>> #!/bin/bash >>>> >>>> origfile=$1 >>>> destfile=$2 >>>> >>>> origsum=`md5sum $origfile | cut -d' ' -f1` >>>> i=0 >>>> >>>> while true; do >>>> echo $i >>>> rm -f $destfile $origfile.tmp >>>> >>>> dd if=$origfile of=$destfile bs=100000 >>>> if [ $? -ne 0 ]; then >>>> echo "dd1 failed" >>>> exit 1 >>>> fi >>>> >>>> dd if=$destfile of=$origfile.tmp bs=100000 >>>> if [ $? -ne 0 ]; then >>>> echo "dd2 failed" >>>> exit 1 >>>> fi >>>> >>>> destsum=`md5sum $destfile | cut -d' ' -f1` >>> >>> As you have already read $destfile to $origfile.tmp, there is no need >>> to read it again - you only need to calculate md5sum of the >>> origfile.tmp. >>> >>>> if [ "$origsum" != "$destsum" ]; then >>>> echo "md5sums don't match! orig=$origsum dest=$destsum" >>>> stat $origfile >>>> stat $destfile >>>> exit 1 >>>> fi >>>> >>>> i=`expr $i + 1` >>>> done >>>> >>>> -------------------------[snip]------------------------------ >>>> >>>> I ran the above with the first arg set to a ~615M .iso file on local >>>> disk and the second to a file on a cifs mount. >>>> >>>> I ran it against my win2k8 host for several hours and it never failed. >>>> I then tried running it against my Windows 7 home host (running on >>>> bare-metal) and it would run for a little while and would eventually >>>> fail due to the server returning "out of memory" errors. Some of those >>>> would occur on the NEGOTIATE call, so I chalk that up to a Win7 bug. >>>> >>>> I never saw this mismatch, but I think we can try to infer something >>>> from the nature of the failures that Pavel saw... >>>> >>>> Since the file was apparently being written properly, the write phase >>>> seems like it worked correctly. The data all went into the cache, and >>>> then got flushed properly to the server. >>>> >>>> So, it seems likely that the problem is in the read phase of the test. >>>> There are several possibilities: >>>> >>>> 1) we started out doing a cache read, but the cache was invalidated >>>> partway through. "Something happened" and one of the reads got mangled. >>>> >>>> 2) the server sent us a corrupt read for some reason >>>> >>>> 3) lower level networking problem caused a corrupt read >>>> >>>> 4) generic memory corruption in the pagecache of some sort >>>> >>>> ...plus many others... >>>> >>>> The fact that only 127 bytes was corrupt is very odd. It would be >>>> easier to understand if an entire page were bad, or an entire rsize >>>> chunk. >>>> >>>> If you are able to reproduce this again, it might be helpful to see if >>>> that's consistent. Try to nail down the nature of the corruption -- see >>>> how much is different and where the different parts are. That may >>>> help shed light on the problem... >>>> >>>> In any case, this will probably take some digging -- we should probably >>>> open a bug at bugzilla.samba.org and start working on this there. >>>> Pavel, would you mind doing that when you have time? >>>> >>>> Thanks, >>>> -- >>>> Jeff Layton <jlayton@xxxxxxxxxx> >>>> >>> >>> So, after a closer investigating of the problem I figured out that: >>> >>> 1) It always reproduces after I boot the OS, load module, mount share >>> and read the existing file. >>> >>> 2) Network traffics that are caught by wireshark on the server >>> (Windows 7) and the client are different - I checked it and found the >>> same difference in response packets for the area that is different on >>> orig and orig.tmp files (the response packet from the capture on the >>> server was true and the response packet from the capture on the client >>> was failed). >>> >>> 3) The different area is always 128 bytes bounded but appears in >>> different places. >>> >>> 4) It doesn't depends on a maybe broken LAN cable - I used two >>> different ones with the same results. >>> >>> So, I don't think that it's cifs module issue and there is no need to >>> open a bug on bugzilla.samba.org. It seems that it's the problem with >>> the network driver or with the LAN card from my laptop. >>> >>> Make sense? >> >> Yes ... but it brings up the obvious question ... what happens if cifs >> signing is turned on? >> > > I tried it - one "Unexpected SMB signature" message appears in dmesg. > > I found this code in cifs_check_receive: > 503 /* convert the length into a more usable form */ > 504 if (server->sec_mode & (SECMODE_SIGN_REQUIRED | > SECMODE_SIGN_ENABLED)) { > 505 struct kvec iov; > 506 > 507 iov.iov_base = mid->resp_buf; > 508 iov.iov_len = len; > 509 /* FIXME: add code to kill session */ > 510 if (cifs_verify_signature(&iov, 1, server, > 511 > mid->sequence_number + 1) != 0) > 512 cERROR(1, "Unexpected SMB signature"); > 513 } > > So, now we don't do anything with if such a thing happens. My be we > should follow the FIXME comment and kill the session? > > -- > Best regards, > Pavel Shilovsky. > -- Thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html