Re: NFS corruption, fixed by echo 1 > /proc/sys/vm/drop_caches -- next debugging steps?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Mar 15, 2017 at 08:31:19AM -0700, Matt Turner wrote:

> On Wed, Mar 15, 2017 at 7:00 AM, Manuel Lauss <manuel.lauss@xxxxxxxxx> wrote:
> >
> > On Wed, Mar 15, 2017 at 10:25 AM, Ralf Baechle <ralf@xxxxxxxxxxxxxx> wrote:
> >>
> >> On Mon, Mar 13, 2017 at 09:47:57AM +0000, James Hogan wrote:
> >>
> >> > >
> >> > > Note that the corruption is different across reboots, both in the size
> >> > > of the corruption and the location. I saw 1900~ and 1400~ byte
> >> > > sequences corrupted on separate occasions, which don't correspond to
> >> > > the system's 16kB page size.
> >> > >
> >> > > I've tested kernels from v3.19 to 4.11-rc1+ (master branch from
> >> > > today). All exhibit this behavior with differing frequencies. Earlier
> >> > > kernels seem to reproduce the issue less often, while more recent
> >> > > kernels reliably exhibit the problem every boot.
> >> > >
> >> > > How can I further debug this?
> >> >
> >> > It smells a bit like a DMA / caching issue.
> >> >
> >> > Can you provide a full kernel log. That might provide some information
> >> > about caching that might be relevant (e.g. does dcache have aliases?).
> >>
> >> The architecture of the BCM1250 SOC used for the BCM91250 boards are
> >> fully coherent, S-cache and D-cache are physically indexed and tagged.
> >> Only the VIVT (plus the usual ASID tagging) I-cache leaves space for
> >> software to screw up cache management but that shouldn't matter for this
> >> case, so I suggest to start looking into this from the NFS side.
> >
> >
> > I did Matt's tests on Alchemy (VIPT caches) with kernels 3.18 to 4.11-rc
> > against
> > an x86 4.9.15 host, and did not see any problems.   Given Ralf's comment
> > about the BCM1250 caches, maybe you have bad hardware (BCM board or
> > network) ?
> 
> I certainly cannot rule that possibility out. If that is the case, I
> would like to be sure of it -- see a failure in memtester or something
> for instance. Any suggestions? (I have run memtester and never found
> anything)
> 
> For what its worth, did you determine the cause of the NFS corruption
> you reported [1]?
> 
> [1] https://www.spinics.net/lists/mips/msg44006.html

I've chased my fair share of kernel bugs on Sibyte systems that were
caused by faulty or unsuitable memory modules, even the BGA solder points
of the BCM1250 SOC coming off.  If you have memory modules in both
banks you may want to try if you can reproduce them with only one
bank populated and if it makes a difference if only bank one or only
bank two is populated.  Firmware updates have fixed various issues with
memory controller initialization over the years so if you haven't
updated to the latest and greatest CFE for the board, you may want to
try that.

  Ralf




[Index of Archives]     [Linux MIPS Home]     [LKML Archive]     [Linux ARM Kernel]     [Linux ARM]     [Linux]     [Git]     [Yosemite News]     [Linux SCSI]     [Linux Hams]

  Powered by Linux