Hi Steve, On 2/5/20, Steve deRosier <derosier@xxxxxxxxx> wrote: > I've been following your questions on both this list and the > linux-wireless one. May I recommend some reading: > http://www.linux-mtd.infradead.org/doc/nand.html > > It isn't clear what filesystem you're using, though I recall from an > earlier email you weren't running UBIFS. But in the log I do see UBIFS > messages. In any case, based on your descriptions, I strongly suspect > NAND bitflips are causing your filesystem corruptions, and you likely > don't have the correct settings for the ECC strength as necessary for > your NAND. Or maybe you're not flashing images correctly and the ECC > info is getting lost. Or maybe you're writing logs and such to flash > and you're filing up the filesystem. Maybe your extents aren't correct > and one filesystem overwrites another. Unfortunately, you've got your > system so cobbled up with user-space prettiness in your log output > that you're obscuring the kernel log output that would help you > diagnose the problems. Yes, the file system is UBIFS, the different revision of test units have been running for many months, they were relative stable until now for a new revision of hardware. Like you found, we have lots of problems in low level when running the new revision of hardware. As both firmware and hardware evolved, the first rational thing is to narrow down the source of the problem. I appreciate all your advice which are very helpful and valid, the hardware was designed by other contractors, there is limited tools and equipment for software guy to debug the hardware. Hardware contractors firmly ruled out any issues in hardware, they pointed finger to software image built from Yocto to cause the NAND corruption. The Yocto image contains all open sources, Linux kernel, connman, MTD, ofono, etc, so I try to figure out if there are limitations and constrains to turn the device power off while it may be in the middle of erasing pages, would that cause the NAND flash corrupted? Or we might not set up things properly? I posted message here to gather information from your experiences and to take your advice to figure out in what circumstances that the NAND corruption could be occurred. So we could mitigate the issues as much as possible. As you said, there are so many things in software and hardware could cause the NAND corruption, what I am particular interested in is if so called a bad Yocto image could cause the NAND corruption, let's make it clear I am not talking about software problems in that image, I am talking about Yocto build system problem which generated a bad image. I thought, if you built a bad image, it would not be able to run at first time, if an image to run NAND booting well for several days, what that the Yocto build system could to make the image corrupted the NAND late like a virus? It does not make sense to me, but I could be wrong. Thank you. Kind regards, - jh ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/