Re: async read patchset test results

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 19 Oct 2011 06:42:29 -0400

On Wed, 19 Oct 2011 00:29:53 +0400
Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote:

> 2011/10/14 Jeff Layton <jlayton@xxxxxxxxxx>:
> > On Fri, 14 Oct 2011 14:02:54 +0400
> > Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote:
> >
> >> Today, I caught it once again and didn't noticed any reconnects (no cERRORs).
> >>
> >> It is surely not depends on Jeff's async read patchset, because I used
> >> my cifs-3.2-current branch.
> >>
> >> My branch consists of Steve's master + lockpatchset + smb2 patches.
> >> From another hand, previously I caught it with Jeff's branch (without
> >> lockpatchset and smb2 patches). So, that's why the problem is in
> >> existing cifs code now.
> >>
> >> FYI: I checked two files: "buggy" and original, and noticed that the
> >> difference between them is located in one place - positions from
> >> 2014442 to 2014569 - 126 differences with two equal holes.
> >>
> >> So, 2014569 - 2014442 + 1 = 128 wrong bytes. Ideas?
> >>
> >
> > Good to know, thanks. I also tried reproducing this for a while last
> > night and was unable to...
> >
> > I used this script:
> >
> > -------------------------[snip]------------------------------
> > #!/bin/bash
> >
> > origfile=$1
> > destfile=$2
> >
> > origsum=`md5sum $origfile | cut -d' ' -f1`
> > i=0
> >
> > while true; do
> >        echo $i
> >        rm -f $destfile $origfile.tmp
> >
> >        dd if=$origfile of=$destfile bs=100000
> >        if [ $? -ne 0 ]; then
> >                echo "dd1 failed"
> >                exit 1
> >        fi
> >
> >        dd if=$destfile of=$origfile.tmp bs=100000
> >        if [ $? -ne 0 ]; then
> >                echo "dd2 failed"
> >                exit 1
> >        fi
> >
> >        destsum=`md5sum $destfile | cut -d' ' -f1`
> 
> As you have already read $destfile to $origfile.tmp, there is no need
> to read it again - you only need to calculate md5sum of the
> origfile.tmp.
> 
> >        if [ "$origsum" != "$destsum" ]; then
> >                echo "md5sums don't match! orig=$origsum dest=$destsum"
> >                stat $origfile
> >                stat $destfile
> >                exit 1
> >        fi
> >
> >        i=`expr $i + 1`
> > done
> >
> > -------------------------[snip]------------------------------
> >
> > I ran the above with the first arg set to a ~615M .iso file on local
> > disk and the second to a file on a cifs mount.
> >
> > I ran it against my win2k8 host for several hours and it never failed.
> > I then tried running it against my Windows 7 home host (running on
> > bare-metal) and it would run for a little while and would eventually
> > fail due to the server returning "out of memory" errors. Some of those
> > would occur on the NEGOTIATE call, so I chalk that up to a Win7 bug.
> >
> > I never saw this mismatch, but I think we can try to infer something
> > from the nature of the failures that Pavel saw...
> >
> > Since the file was apparently being written properly, the write phase
> > seems like it worked correctly. The data all went into the cache, and
> > then got flushed properly to the server.
> >
> > So, it seems likely that the problem is in the read phase of the test.
> > There are several possibilities:
> >
> > 1) we started out doing a cache read, but the cache was invalidated
> > partway through. "Something happened" and one of the reads got mangled.
> >
> > 2) the server sent us a corrupt read for some reason
> >
> > 3) lower level networking problem caused a corrupt read
> >
> > 4) generic memory corruption in the pagecache of some sort
> >
> > ...plus many others...
> >
> > The fact that only 127 bytes was corrupt is very odd. It would be
> > easier to understand if an entire page were bad, or an entire rsize
> > chunk.
> >
> > If you are able to reproduce this again, it might be helpful to see if
> > that's consistent. Try to nail down the nature of the corruption -- see
> > how much is different and where the different parts are. That may
> > help shed light on the problem...
> >
> > In any case, this will probably take some digging -- we should probably
> > open a bug at bugzilla.samba.org and start working on this there.
> > Pavel, would you mind doing that when you have time?
> >
> > Thanks,
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>
> >
> 
> So, after a closer investigating of the problem I figured out that:
> 
> 1) It always reproduces after I boot the OS, load module, mount share
> and read the existing file.
> 
> 2) Network traffics that are caught by wireshark on the server
> (Windows 7) and the client are different - I checked it and found the
> same difference in response packets for the area that is different on
> orig and orig.tmp files (the response packet from the capture on the
> server was true and the response packet from the capture on the client
> was failed).
> 
> 3) The different area is always 128 bytes bounded but appears in
> different places.
> 
> 4) It doesn't depends on a maybe broken LAN cable - I used two
> different ones with the same results.
> 
> So, I don't think that it's cifs module issue and there is no need to
> open a bug on bugzilla.samba.org. It seems that it's the problem with
> the network driver or with the LAN card from my laptop.
> 
> Make sense?
> 

Makes sense. Thanks for digging deeply into it. That helps explain why I
haven't been able to reproduce this.

You may want to open a bug at kernel.org with these findings or send an
email to LKML or the netdev mailing list. I'd imagine that those folks
would be very interested in this.

What kind of networking hardware is in this laptop, btw?

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html