Re: Corruped NAND booting for all devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi JH,

On Tue, Feb 4, 2020 at 5:43 PM JH <jupiter.hce@xxxxxxxxx> wrote:
>
> Hi,
>
> It is a bad day we have 5 devices failed NAND booting all of certain
> today. The 5 devices running kernel 4.19.75 on iMX6ULL customized
> board, the devices had been running for weeks, the device DC power is
> supplied from AC via ADC and regulator, we turned power on and off
> several times when installing those test devices to test boxes in the
> last couple of days without problems, then they all failed together
> mysteriously today. It could not complete the booting to Linux user
> space, so I am not able to log into the user space to check and to
> debug it.
>
...snip...

I've been following your questions on both this list and the
linux-wireless one. May I recommend some reading:
http://www.linux-mtd.infradead.org/doc/nand.html

It isn't clear what filesystem you're using, though I recall from an
earlier email you weren't running UBIFS. But in the log I do see UBIFS
messages. In any case, based on your descriptions, I strongly suspect
NAND bitflips are causing your filesystem corruptions, and you likely
don't have the correct settings for the ECC strength as necessary for
your NAND. Or maybe you're not flashing images correctly and the ECC
info is getting lost.  Or maybe you're writing logs and such to flash
and you're filing up the filesystem. Maybe your extents aren't correct
and one filesystem overwrites another. Unfortunately, you've got your
system so cobbled up with user-space prettiness in your log output
that you're obscuring the kernel log output that would help you
diagnose the problems.

Some steps/advice to help your debugging:
* Stop making assumptions about what is or couldn't possibly be wrong.
Use evidence only. Test and validate each assumption.
* Fix your serial port logging output so you can actually see all the
kernel messages instead of the systemd messages that aren't helping
you.
* You don't need access to the user-space tools on your corrupted
filesystem. You could nfs mount a root via U-Boot and then use the
tools to analyze your flash.
* Read the schematics of your device. Understand how your NAND is
hooked up.  Is it correct?
* Read the datasheet of your NAND and your flash controller. Check
your configurations against requirements.
* Understand what your flash partitions are, where each filesystem is
in NAND, etc. Check that the extents are correct and you're not
overwriting.
* Make sure you're not writing stuff to your flash and filling it.
* Check your required ECC strength. Verify that ECC bits are actually
being written during use and are correct.
* Make sure your method of flashing images write the ECC bits correctly. Verify.
* You can use U-Boot to dump your NAND pages and verify your ECC bits
are being written how they should be.
* Enable as much kernel log output as possible so you can see the
relevant debug messages.

Also see the list here:
http://lists.infradead.org/pipermail/linux-mtd/2018-December/086331.html

I don't know what's going on with your system. You have presented a
large number of random symptoms, a lot of assumptions, but very little
real information that we can help you with. And from the information
you present, pretty much no one here is going to be able to solve it
for you - _you_ need to solve your problem. The only way you're going
to do that is to UNDERSTAND the problem first. Get the right debug
output, understand your hardware inside and out, and verify the
software matches the hardware configuration and you'll probably get a
lot closer to finding your problem.

Debugging flash corruption problems is a non-trivial activity.  Last
product I had to do it on took me 6 months of investigation before we
finally solved it. It was a combination of several errors, and fixing
each one helped, but of course made the others harder to find as the
cycle time between failures increased. Some problems were our fault
and others were caused by an undocumented silicon error that took us a
while to realize and work around. Buckle down and go step by step.
With luck you'll find the problem quickly. If not, take it as an
opportunity to become an expert in every level of your system. I wish
you luck.

- Steve

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/



[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux