Re: async read patchset test results

Steve French <smfrench@xxxxxxxxx> · Tue, 18 Oct 2011 20:24:17 -0500

If the session could continue after an incorrect checksum (ie retry
the operation
that failed due to the network error/corruption)
then there would be significant advantage in not taking the session down.
In practice would that request time out and get killed and retried?

On Tue, Oct 18, 2011 at 3:59 PM, Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote:
> 2011/10/19 Steve French <smfrench@xxxxxxxxx>:
>> On Tue, Oct 18, 2011 at 3:29 PM, Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote:
>>> 2011/10/14 Jeff Layton <jlayton@xxxxxxxxxx>:
>>>> On Fri, 14 Oct 2011 14:02:54 +0400
>>>> Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote:
>>>>
>>>>> Today, I caught it once again and didn't noticed any reconnects (no cERRORs).
>>>>>
>>>>> It is surely not depends on Jeff's async read patchset, because I used
>>>>> my cifs-3.2-current branch.
>>>>>
>>>>> My branch consists of Steve's master + lockpatchset + smb2 patches.
>>>>> From another hand, previously I caught it with Jeff's branch (without
>>>>> lockpatchset and smb2 patches). So, that's why the problem is in
>>>>> existing cifs code now.
>>>>>
>>>>> FYI: I checked two files: "buggy" and original, and noticed that the
>>>>> difference between them is located in one place - positions from
>>>>> 2014442 to 2014569 - 126 differences with two equal holes.
>>>>>
>>>>> So, 2014569 - 2014442 + 1 = 128 wrong bytes. Ideas?
>>>>>
>>>>
>>>> Good to know, thanks. I also tried reproducing this for a while last
>>>> night and was unable to...
>>>>
>>>> I used this script:
>>>>
>>>> -------------------------[snip]------------------------------
>>>> #!/bin/bash
>>>>
>>>> origfile=$1
>>>> destfile=$2
>>>>
>>>> origsum=`md5sum $origfile | cut -d' ' -f1`
>>>> i=0
>>>>
>>>> while true; do
>>>>        echo $i
>>>>        rm -f $destfile $origfile.tmp
>>>>
>>>>        dd if=$origfile of=$destfile bs=100000
>>>>        if [ $? -ne 0 ]; then
>>>>                echo "dd1 failed"
>>>>                exit 1
>>>>        fi
>>>>
>>>>        dd if=$destfile of=$origfile.tmp bs=100000
>>>>        if [ $? -ne 0 ]; then
>>>>                echo "dd2 failed"
>>>>                exit 1
>>>>        fi
>>>>
>>>>        destsum=`md5sum $destfile | cut -d' ' -f1`
>>>
>>> As you have already read $destfile to $origfile.tmp, there is no need
>>> to read it again - you only need to calculate md5sum of the
>>> origfile.tmp.
>>>
>>>>        if [ "$origsum" != "$destsum" ]; then
>>>>                echo "md5sums don't match! orig=$origsum dest=$destsum"
>>>>                stat $origfile
>>>>                stat $destfile
>>>>                exit 1
>>>>        fi
>>>>
>>>>        i=`expr $i + 1`
>>>> done
>>>>
>>>> -------------------------[snip]------------------------------
>>>>
>>>> I ran the above with the first arg set to a ~615M .iso file on local
>>>> disk and the second to a file on a cifs mount.
>>>>
>>>> I ran it against my win2k8 host for several hours and it never failed.
>>>> I then tried running it against my Windows 7 home host (running on
>>>> bare-metal) and it would run for a little while and would eventually
>>>> fail due to the server returning "out of memory" errors. Some of those
>>>> would occur on the NEGOTIATE call, so I chalk that up to a Win7 bug.
>>>>
>>>> I never saw this mismatch, but I think we can try to infer something
>>>> from the nature of the failures that Pavel saw...
>>>>
>>>> Since the file was apparently being written properly, the write phase
>>>> seems like it worked correctly. The data all went into the cache, and
>>>> then got flushed properly to the server.
>>>>
>>>> So, it seems likely that the problem is in the read phase of the test.
>>>> There are several possibilities:
>>>>
>>>> 1) we started out doing a cache read, but the cache was invalidated
>>>> partway through. "Something happened" and one of the reads got mangled.
>>>>
>>>> 2) the server sent us a corrupt read for some reason
>>>>
>>>> 3) lower level networking problem caused a corrupt read
>>>>
>>>> 4) generic memory corruption in the pagecache of some sort
>>>>
>>>> ...plus many others...
>>>>
>>>> The fact that only 127 bytes was corrupt is very odd. It would be
>>>> easier to understand if an entire page were bad, or an entire rsize
>>>> chunk.
>>>>
>>>> If you are able to reproduce this again, it might be helpful to see if
>>>> that's consistent. Try to nail down the nature of the corruption -- see
>>>> how much is different and where the different parts are. That may
>>>> help shed light on the problem...
>>>>
>>>> In any case, this will probably take some digging -- we should probably
>>>> open a bug at bugzilla.samba.org and start working on this there.
>>>> Pavel, would you mind doing that when you have time?
>>>>
>>>> Thanks,
>>>> --
>>>> Jeff Layton <jlayton@xxxxxxxxxx>
>>>>
>>>
>>> So, after a closer investigating of the problem I figured out that:
>>>
>>> 1) It always reproduces after I boot the OS, load module, mount share
>>> and read the existing file.
>>>
>>> 2) Network traffics that are caught by wireshark on the server
>>> (Windows 7) and the client are different - I checked it and found the
>>> same difference in response packets for the area that is different on
>>> orig and orig.tmp files (the response packet from the capture on the
>>> server was true and the response packet from the capture on the client
>>> was failed).
>>>
>>> 3) The different area is always 128 bytes bounded but appears in
>>> different places.
>>>
>>> 4) It doesn't depends on a maybe broken LAN cable - I used two
>>> different ones with the same results.
>>>
>>> So, I don't think that it's cifs module issue and there is no need to
>>> open a bug on bugzilla.samba.org. It seems that it's the problem with
>>> the network driver or with the LAN card from my laptop.
>>>
>>> Make sense?
>>
>> Yes ... but it brings up the obvious question ... what happens if cifs
>> signing is turned on?
>>
>
> I tried it - one "Unexpected SMB signature" message appears in dmesg.
>
> I found this code in cifs_check_receive:
> 503          /* convert the length into a more usable form */
> 504          if (server->sec_mode & (SECMODE_SIGN_REQUIRED |
> SECMODE_SIGN_ENABLED)) {
> 505                 struct kvec iov;
> 506
> 507                 iov.iov_base = mid->resp_buf;
> 508                 iov.iov_len = len;
> 509                 /* FIXME: add code to kill session */
> 510                 if (cifs_verify_signature(&iov, 1, server,
> 511
> mid->sequence_number + 1) != 0)
> 512                         cERROR(1, "Unexpected SMB signature");
> 513         }
>
> So, now we don't do anything with if such a thing happens. My be we
> should follow the FIXME comment and kill the session?
>
> --
> Best regards,
> Pavel Shilovsky.
>

-- 
Thanks,

Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html