Hi JH, On Tue, Feb 4, 2020 at 5:43 PM JH <jupiter.hce@xxxxxxxxx> wrote: > > Hi, > > It is a bad day we have 5 devices failed NAND booting all of certain > today. The 5 devices running kernel 4.19.75 on iMX6ULL customized > board, the devices had been running for weeks, the device DC power is > supplied from AC via ADC and regulator, we turned power on and off > several times when installing those test devices to test boxes in the > last couple of days without problems, then they all failed together > mysteriously today. It could not complete the booting to Linux user > space, so I am not able to log into the user space to check and to > debug it. > ...snip... I've been following your questions on both this list and the linux-wireless one. May I recommend some reading: http://www.linux-mtd.infradead.org/doc/nand.html It isn't clear what filesystem you're using, though I recall from an earlier email you weren't running UBIFS. But in the log I do see UBIFS messages. In any case, based on your descriptions, I strongly suspect NAND bitflips are causing your filesystem corruptions, and you likely don't have the correct settings for the ECC strength as necessary for your NAND. Or maybe you're not flashing images correctly and the ECC info is getting lost. Or maybe you're writing logs and such to flash and you're filing up the filesystem. Maybe your extents aren't correct and one filesystem overwrites another. Unfortunately, you've got your system so cobbled up with user-space prettiness in your log output that you're obscuring the kernel log output that would help you diagnose the problems. Some steps/advice to help your debugging: * Stop making assumptions about what is or couldn't possibly be wrong. Use evidence only. Test and validate each assumption. * Fix your serial port logging output so you can actually see all the kernel messages instead of the systemd messages that aren't helping you. * You don't need access to the user-space tools on your corrupted filesystem. You could nfs mount a root via U-Boot and then use the tools to analyze your flash. * Read the schematics of your device. Understand how your NAND is hooked up. Is it correct? * Read the datasheet of your NAND and your flash controller. Check your configurations against requirements. * Understand what your flash partitions are, where each filesystem is in NAND, etc. Check that the extents are correct and you're not overwriting. * Make sure you're not writing stuff to your flash and filling it. * Check your required ECC strength. Verify that ECC bits are actually being written during use and are correct. * Make sure your method of flashing images write the ECC bits correctly. Verify. * You can use U-Boot to dump your NAND pages and verify your ECC bits are being written how they should be. * Enable as much kernel log output as possible so you can see the relevant debug messages. Also see the list here: http://lists.infradead.org/pipermail/linux-mtd/2018-December/086331.html I don't know what's going on with your system. You have presented a large number of random symptoms, a lot of assumptions, but very little real information that we can help you with. And from the information you present, pretty much no one here is going to be able to solve it for you - _you_ need to solve your problem. The only way you're going to do that is to UNDERSTAND the problem first. Get the right debug output, understand your hardware inside and out, and verify the software matches the hardware configuration and you'll probably get a lot closer to finding your problem. Debugging flash corruption problems is a non-trivial activity. Last product I had to do it on took me 6 months of investigation before we finally solved it. It was a combination of several errors, and fixing each one helped, but of course made the others harder to find as the cycle time between failures increased. Some problems were our fault and others were caused by an undocumented silicon error that took us a while to realize and work around. Buckle down and go step by step. With luck you'll find the problem quickly. If not, take it as an opportunity to become an expert in every level of your system. I wish you luck. - Steve ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/