Re: async read patchset test results

Steve French <smfrench@xxxxxxxxx> · Tue, 18 Oct 2011 15:31:40 -0500

On Tue, Oct 18, 2011 at 3:29 PM, Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote:
> 2011/10/14 Jeff Layton <jlayton@xxxxxxxxxx>:
>> On Fri, 14 Oct 2011 14:02:54 +0400
>> Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote:
>>
>>> Today, I caught it once again and didn't noticed any reconnects (no cERRORs).
>>>
>>> It is surely not depends on Jeff's async read patchset, because I used
>>> my cifs-3.2-current branch.
>>>
>>> My branch consists of Steve's master + lockpatchset + smb2 patches.
>>> From another hand, previously I caught it with Jeff's branch (without
>>> lockpatchset and smb2 patches). So, that's why the problem is in
>>> existing cifs code now.
>>>
>>> FYI: I checked two files: "buggy" and original, and noticed that the
>>> difference between them is located in one place - positions from
>>> 2014442 to 2014569 - 126 differences with two equal holes.
>>>
>>> So, 2014569 - 2014442 + 1 = 128 wrong bytes. Ideas?
>>>
>>
>> Good to know, thanks. I also tried reproducing this for a while last
>> night and was unable to...
>>
>> I used this script:
>>
>> -------------------------[snip]------------------------------
>> #!/bin/bash
>>
>> origfile=$1
>> destfile=$2
>>
>> origsum=`md5sum $origfile | cut -d' ' -f1`
>> i=0
>>
>> while true; do
>>        echo $i
>>        rm -f $destfile $origfile.tmp
>>
>>        dd if=$origfile of=$destfile bs=100000
>>        if [ $? -ne 0 ]; then
>>                echo "dd1 failed"
>>                exit 1
>>        fi
>>
>>        dd if=$destfile of=$origfile.tmp bs=100000
>>        if [ $? -ne 0 ]; then
>>                echo "dd2 failed"
>>                exit 1
>>        fi
>>
>>        destsum=`md5sum $destfile | cut -d' ' -f1`
>
> As you have already read $destfile to $origfile.tmp, there is no need
> to read it again - you only need to calculate md5sum of the
> origfile.tmp.
>
>>        if [ "$origsum" != "$destsum" ]; then
>>                echo "md5sums don't match! orig=$origsum dest=$destsum"
>>                stat $origfile
>>                stat $destfile
>>                exit 1
>>        fi
>>
>>        i=`expr $i + 1`
>> done
>>
>> -------------------------[snip]------------------------------
>>
>> I ran the above with the first arg set to a ~615M .iso file on local
>> disk and the second to a file on a cifs mount.
>>
>> I ran it against my win2k8 host for several hours and it never failed.
>> I then tried running it against my Windows 7 home host (running on
>> bare-metal) and it would run for a little while and would eventually
>> fail due to the server returning "out of memory" errors. Some of those
>> would occur on the NEGOTIATE call, so I chalk that up to a Win7 bug.
>>
>> I never saw this mismatch, but I think we can try to infer something
>> from the nature of the failures that Pavel saw...
>>
>> Since the file was apparently being written properly, the write phase
>> seems like it worked correctly. The data all went into the cache, and
>> then got flushed properly to the server.
>>
>> So, it seems likely that the problem is in the read phase of the test.
>> There are several possibilities:
>>
>> 1) we started out doing a cache read, but the cache was invalidated
>> partway through. "Something happened" and one of the reads got mangled.
>>
>> 2) the server sent us a corrupt read for some reason
>>
>> 3) lower level networking problem caused a corrupt read
>>
>> 4) generic memory corruption in the pagecache of some sort
>>
>> ...plus many others...
>>
>> The fact that only 127 bytes was corrupt is very odd. It would be
>> easier to understand if an entire page were bad, or an entire rsize
>> chunk.
>>
>> If you are able to reproduce this again, it might be helpful to see if
>> that's consistent. Try to nail down the nature of the corruption -- see
>> how much is different and where the different parts are. That may
>> help shed light on the problem...
>>
>> In any case, this will probably take some digging -- we should probably
>> open a bug at bugzilla.samba.org and start working on this there.
>> Pavel, would you mind doing that when you have time?
>>
>> Thanks,
>> --
>> Jeff Layton <jlayton@xxxxxxxxxx>
>>
>
> So, after a closer investigating of the problem I figured out that:
>
> 1) It always reproduces after I boot the OS, load module, mount share
> and read the existing file.
>
> 2) Network traffics that are caught by wireshark on the server
> (Windows 7) and the client are different - I checked it and found the
> same difference in response packets for the area that is different on
> orig and orig.tmp files (the response packet from the capture on the
> server was true and the response packet from the capture on the client
> was failed).
>
> 3) The different area is always 128 bytes bounded but appears in
> different places.
>
> 4) It doesn't depends on a maybe broken LAN cable - I used two
> different ones with the same results.
>
> So, I don't think that it's cifs module issue and there is no need to
> open a bug on bugzilla.samba.org. It seems that it's the problem with
> the network driver or with the LAN card from my laptop.
>
> Make sense?

Yes ... but it brings up the obvious question ... what happens if cifs
signing is turned on?

-- 
Thanks,

Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html