Hi JH, On Tue, Feb 4, 2020 at 11:58 PM JH <jupiter.hce@xxxxxxxxx> wrote: > > Hi Steve, > > On 2/5/20, Steve deRosier <derosier@xxxxxxxxx> wrote: > > I've been following your questions on both this list and the > > linux-wireless one. May I recommend some reading: > > http://www.linux-mtd.infradead.org/doc/nand.html > > > > It isn't clear what filesystem you're using, though I recall from an > > earlier email you weren't running UBIFS. But in the log I do see UBIFS > > messages. In any case, based on your descriptions, I strongly suspect > > NAND bitflips are causing your filesystem corruptions, and you likely > > don't have the correct settings for the ECC strength as necessary for > > your NAND. Or maybe you're not flashing images correctly and the ECC > > info is getting lost. Or maybe you're writing logs and such to flash > > and you're filing up the filesystem. Maybe your extents aren't correct > > and one filesystem overwrites another. Unfortunately, you've got your > > system so cobbled up with user-space prettiness in your log output > > that you're obscuring the kernel log output that would help you > > diagnose the problems. > > Yes, the file system is UBIFS, the different revision of test units > have been running for many months, they were relative stable until now > for a new revision of hardware. Like you found, we have lots of > problems in low level when running the new revision of hardware. As > both firmware and hardware evolved, the first rational thing is to > narrow down the source of the problem. > My suggestion, assuming that you have a version that ran on the old hardware and that it can run on the new hardware (so, basics like processor types, etc haven't changed enough to keep it from working) is to run the known-good software on the new hardware and see where you stand. My prediction is it will fail hard, but it should be informative. With hardware changes, there's two levels: 1. Small enough changes it's still more or less the same platform, but with a few things that need to be changed. 2. Big enough changes it should be considered a new platform and treated as such. Either way, the approach is similar, just different in scope. In the former, take a look at the old schematic and the new schematic and see what changed. Pull the datasheets of any chips that changed (both old and new) and check changed parameters. Check for changed port lines, gpio pulls, chip selects, and timing parameters. In the case of the latter, we're talking the processor changed, the device architecture changed (not like ARM to MIPS, that's covered in "processor" changed, more like how your overall device is designed), etc... Honestly just start over. Assume everything changed and start over from scratch. Examine everything, make sure your DT matches, your compile flags. Test and confirm everything. Basically ground-up board bring-up. A good question for you - did the old design use SLC NAND and the new MLC NAND? That makes a huge difference. Also, I've found that not all manufacturers are equal in reliability. I had a hardware team that wanted to substitute a cheaper but "equivalent" part from a different manufacturer. It had the exact same specs and in theory should have run seamlessly - yet we had endless corruptions on the first articles they sent me. We just said "no" and stayed with the more expensive part because saving $0.45 on something that would sell only a few tens-of-thousands of units wasn't worth the engineering time to "solve it in software" if it was even indeed possible to do so. Note that this was only possible because the only thing that changed was the one chip, which from a software standpoint should have been identical. When you only change one thing at a time, it helps you find the problems. > I appreciate all your advice which are very helpful and valid, the > hardware was designed by other contractors, there is limited tools and > equipment for software guy to debug the hardware. Hardware contractors > firmly ruled out any issues in hardware, they pointed finger to > software image built from Yocto to cause the NAND corruption. The Of course they are saying "it's not my problem". You seem to be living in a "throw-it-over-the-wall" style organization. In my experience, you have three choices - get the hell out, become a hardware expert, or become close friends with one of the guys on the hardware team and change the culture. You need to realize that in a way they're right - by their limited perspective, ie the electrons go where they should go, everything is fine. But gluing a few chips down on a board is only 2% of an embedded engineer's job; there's a lot more to do because our chips are now so programmable, and at the end of the day it still needs to work right. Again, examine and understand the datasheets. Check against the schematic. Satisfy yourself there's no electrical errors. Validate that every value that gets set in a controller register is the correct one. > Yocto image contains all open sources, Linux kernel, connman, MTD, > ofono, etc, so I try to figure out if there are limitations and > constrains to turn the device power off while it may be in the middle > of erasing pages, would that cause the NAND flash corrupted? Or we > might not set up things properly? > UBI is designed to be power-cut safe. Not to say there hasn't been bugs or isn't a bug. But basic things like "the power cut happened when we erased a page" shouldn't be a concern unless there's a driver bug. > As you said, there are so many things in software and hardware could > cause the NAND corruption, what I am particular interested in is if so > called a bad Yocto image could cause the NAND corruption, let's make > it clear I am not talking about software problems in that image, I am > talking about Yocto build system problem which generated a bad image. > I thought, if you built a bad image, it would not be able to run at > first time, if an image to run NAND booting well for several days, > what that the Yocto build system could to make the image corrupted the > NAND late like a virus? It does not make sense to me, but I could be > wrong. While I would say "nothing is impossible", the scenario you're talking about is close to it. You're looking down a dry hole. If you can successfully build and run it, it's not a "bad Yocto image" in the sense you're describing. Data loss over time is usually a bit-rot issue in my experience. Check your ECC parameters, check that ECC data is being written correctly in ALL cases. Check that U-boot and the kernel agree on the ECC parameters. Note - you should be seeing ECC warning messages in your kernel log on boot, but to my eye you don't have the right settings there to see what you need to see. Also, be sure you're not filling up the filesystem, or have partitions overlapping and thus overwriting each other. One type of "bad image" is where the new one it is larger than expected, and thus when you flash it you either overwrite some other area or you truncate what you are flashing. In some cases like these, it is possible for it to run normally...until it doesn't. But those types of things are easy to check. Check that your method of flashing doesn't ignore bad-block markers. Or that when you flash it that ECC data gets written correctly. To give you an example: I had a system where the typically method of flashing the system for production was via u-boot. I worked with it every day reflashing via u-boot. I never had a problem. We got random reports of problems from the field and even some of my colleagues would see corrupted (couldn't boot) NAND after a while. Going through the problem, I discovered that our u-boot method of flashing worked perfect and correctly wrote ECC. I eventually discovered that our user-space update script would not write ECC but instead left it cleared. And, at least with the version of UBIFS we were using, UBIFS was tolerant of no ECC data. Basically, it would read the page and if there was ECC data it would validate it and correct or error out if there was a problem, but if there was no ECC data for the page, it would short-circuit and basically say "nothing for me to check, all is OK". So, in the short term anything flashed with the buggy update script would run fine and would only show problems weeks or so later. And of course, only if a bit flip happened in the wrong place, etc.... So it was rare enough it took a while to notice. But devices that were never updated (only production flashed) would be fine. Or ones that got upgraded via the u-boot method, like I happened to do, never saw the problems. Hard to track down and only found because I went through everything. Hence my comment of "ALL cases". Hope that helps, - Steve ______________________________________________________ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/