On Mon, Sep 14, 2015 at 03:16:50PM -0700, Oren Laskin wrote: > I would hit this error on my Armada 370 board about 20% of the time > after downloading a 30MB file to /tmp. We're running a 1 Gb SGMII > link. I would hit this in less than a minute before removing this > commit from my tree. I've now been running this test in a loop for a > few hours with no problems. Outch. At the time I have tested this patch with several runs of 20 wget/md5sum of a 1GB file (with jumbo frames enabled or not). I have also used this program: http://git.lacie-nas.org/?p=netsum.git;a=summary. It allows to detect data corruption over network very quickly. It is very weird, I should have seen something... Moreover I understand that you reproduce the issue very quickly and without any refilling errors. It is also quite weird because the patch does basically nothing in a such case. BTW, which hardware are you using exactly ? Definitively I'll have a closer look at it tomorrow. Simon > > It was somewhat hard to diagnose since files I used scp didn't see the > issues (or at least as quickly). I set up an http program to serve a > file and replicated the problem with wget and found it. > > Oren > > On Mon, Sep 14, 2015 at 3:13 PM, Simon Guinot <simon.guinot@xxxxxxxxxxxx> wrote: > > Hi Oren, > > > > On Mon, Sep 14, 2015 at 01:22:12PM -0700, Oren Laskin wrote: > >> I had to undo this change on my Amada 370 based board. It was causing > >> corrupt data to make it through on large downloads. I'm using wget to get > >> the same 30MB file many times and the SHA would occasionally be different. > > > > During your tests, can you see some "Linux processing - Can't refill" > > messages along with the data corruptions ? > > > >> I tracked it down to this commit. In it, I would find on the order of a > >> few hundred bytes to simply be wrong data. > > > > I am little bit surprised here. For me, this patch is very simple and > > does the exact opposite. It does fix kernel crashes and data corruptions > > in case of refilling errors. This can happen for example if you run > > large data transfers with jumbo frames enabled... > > > > But anyway, I'll try to reproduce the issue tomorrow. I only have to > > wget the same file (size 30MB) in a loop and to check its md5sum ? > > That's it ? And how long should I wait for the error ? > > > > Thanks, > > > > Simon > > > >> > >> Thanks, > >> > >> Oren > >> > >> On Tue, Jul 21, 2015 at 12:30 AM, David Miller <davem@xxxxxxxxxxxxx> wrote: > >> > >> > From: Simon Guinot <simon.guinot@xxxxxxxxxxxx> > >> > Date: Sun, 19 Jul 2015 13:00:53 +0200 > >> > > >> > > With the actual code, if a memory allocation error happens while > >> > > refilling a Rx descriptor, then the original Rx buffer is both passed > >> > > to the networking stack (in a SKB) and let in the Rx ring. This leads > >> > > to various kernel oops and crashes. > >> > > > >> > > As a fix, this patch moves Rx descriptor refilling ahead of building > >> > > SKB with the associated Rx buffer. In case of a memory allocation > >> > > failure, data is dropped and the original DMA buffer is put back into > >> > > the Rx ring. > >> > > > >> > > Signed-off-by: Simon Guinot <simon.guinot@xxxxxxxxxxxx> > >> > > Fixes: c5aff18204da ("net: mvneta: driver for Marvell Armada 370/XP > >> > network unit") > >> > > Cc: <stable@xxxxxxxxxxxxxxx> # v3.8+ > >> > > Tested-by: Yoann Sculo <yoann@xxxxxxxx> > >> > > >> > Applied, thanks. > >> > > >> > _______________________________________________ > >> > linux-arm-kernel mailing list > >> > linux-arm-kernel@xxxxxxxxxxxxxxxxxxx > >> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > >> >
Attachment:
signature.asc
Description: Digital signature