Re: NFS corruption, fixed by echo 1 > /proc/sys/vm/drop_caches -- next debugging steps?

Joshua Kinard <kumba@xxxxxxxxxx> · Wed, 15 Mar 2017 12:46:03 -0400

On 03/15/2017 11:31, Matt Turner wrote:
> On Wed, Mar 15, 2017 at 7:00 AM, Manuel Lauss <manuel.lauss@xxxxxxxxx> wrote:
>>
>> On Wed, Mar 15, 2017 at 10:25 AM, Ralf Baechle <ralf@xxxxxxxxxxxxxx> wrote:
>>>
>>> On Mon, Mar 13, 2017 at 09:47:57AM +0000, James Hogan wrote:
>>>
>>>>>
>>>>> Note that the corruption is different across reboots, both in the size
>>>>> of the corruption and the location. I saw 1900~ and 1400~ byte
>>>>> sequences corrupted on separate occasions, which don't correspond to
>>>>> the system's 16kB page size.
>>>>>
>>>>> I've tested kernels from v3.19 to 4.11-rc1+ (master branch from
>>>>> today). All exhibit this behavior with differing frequencies. Earlier
>>>>> kernels seem to reproduce the issue less often, while more recent
>>>>> kernels reliably exhibit the problem every boot.
>>>>>
>>>>> How can I further debug this?
>>>>
>>>> It smells a bit like a DMA / caching issue.
>>>>
>>>> Can you provide a full kernel log. That might provide some information
>>>> about caching that might be relevant (e.g. does dcache have aliases?).
>>>
>>> The architecture of the BCM1250 SOC used for the BCM91250 boards are
>>> fully coherent, S-cache and D-cache are physically indexed and tagged.
>>> Only the VIVT (plus the usual ASID tagging) I-cache leaves space for
>>> software to screw up cache management but that shouldn't matter for this
>>> case, so I suggest to start looking into this from the NFS side.
>>
>>
>> I did Matt's tests on Alchemy (VIPT caches) with kernels 3.18 to 4.11-rc
>> against
>> an x86 4.9.15 host, and did not see any problems.   Given Ralf's comment
>> about the BCM1250 caches, maybe you have bad hardware (BCM board or
>> network) ?
> 
> I certainly cannot rule that possibility out. If that is the case, I
> would like to be sure of it -- see a failure in memtester or something
> for instance. Any suggestions? (I have run memtester and never found
> anything)
> 
> For what its worth, did you determine the cause of the NFS corruption
> you reported [1]?
> 
> [1] https://www.spinics.net/lists/mips/msg44006.html

I'm using NFSv4 between my SGI Octane and Intel box with no noticeable issues.
I used both rsync and cp to move a large, ~845MB file between both and
md5summed them both and get the same md5sum back.  What NFS versions and
protocols have you tried?  v4 is TCP-only, but v3 can do both UDP and TCP.

That said, I doubt this'll affect you, but, if you're running the XFS
filesystem, version 5 (crc=1, finobt=1), Do you notice any oddities with
untarring a really large tarball, like a Gentoo stage or such on that BCM
machine?  That's revealed a couple of curious issues that may be
Octane-specific that I haven't tried to trace down yet.  Would be interesting
if you saw them as well.  Specifically, if you get a non-fatal Oops in dmesg
from the above or a message from xfsaild about a possible deadlock in
kmem_alloc(), I'd love to know.

-- 
Joshua Kinard
Gentoo/MIPS
kumba@xxxxxxxxxx
6144R/F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic