Re: Corruped NAND booting for all devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi JH,

On Tue, Feb 4, 2020 at 11:58 PM JH <jupiter.hce@xxxxxxxxx> wrote:
>
> Hi Steve,
>
> On 2/5/20, Steve deRosier <derosier@xxxxxxxxx> wrote:
> > I've been following your questions on both this list and the
> > linux-wireless one. May I recommend some reading:
> > http://www.linux-mtd.infradead.org/doc/nand.html
> >
> > It isn't clear what filesystem you're using, though I recall from an
> > earlier email you weren't running UBIFS. But in the log I do see UBIFS
> > messages. In any case, based on your descriptions, I strongly suspect
> > NAND bitflips are causing your filesystem corruptions, and you likely
> > don't have the correct settings for the ECC strength as necessary for
> > your NAND. Or maybe you're not flashing images correctly and the ECC
> > info is getting lost.  Or maybe you're writing logs and such to flash
> > and you're filing up the filesystem. Maybe your extents aren't correct
> > and one filesystem overwrites another. Unfortunately, you've got your
> > system so cobbled up with user-space prettiness in your log output
> > that you're obscuring the kernel log output that would help you
> > diagnose the problems.
>
> Yes, the file system is UBIFS, the different revision of test units
> have been running for many months, they were relative stable until now
> for a new revision of hardware. Like you found, we have lots of
> problems in low level when running the new revision of hardware. As
> both firmware and hardware evolved, the first rational thing is to
> narrow down the source of the problem.
>

My suggestion, assuming that you have a version that ran on the old
hardware and that it can run on the new hardware (so, basics like
processor types, etc haven't changed enough to keep it from working)
is to run the known-good software on the new hardware and see where
you stand. My prediction is it will fail hard, but it should be
informative.

With hardware changes, there's two levels:

1. Small enough changes it's still more or less the same platform, but
with a few things that need to be changed.
2. Big enough changes it should be considered a new platform and
treated as such.

Either way, the approach is similar, just different in scope. In the
former, take a look at the old schematic and the new schematic and see
what changed. Pull the datasheets of any chips that changed (both old
and new) and check changed parameters. Check for changed port lines,
gpio pulls, chip selects, and timing parameters.

In the case of the latter, we're talking the processor changed, the
device architecture changed (not like ARM to MIPS, that's covered in
"processor" changed, more like how your overall device is designed),
etc...  Honestly just start over. Assume everything changed and start
over from scratch. Examine everything, make sure your DT matches, your
compile flags. Test and confirm everything. Basically ground-up board
bring-up.

A good question for you - did the old design use SLC NAND and the new
MLC NAND? That makes a huge difference. Also, I've found that not all
manufacturers are equal in reliability. I had a hardware team that
wanted to substitute a cheaper but "equivalent" part from a different
manufacturer. It had the exact same specs and in theory should have
run seamlessly - yet we had endless corruptions on the first articles
they sent me. We just said "no" and stayed with the more expensive
part because saving $0.45 on something that would sell only a few
tens-of-thousands of units wasn't worth the engineering time to "solve
it in software" if it was even indeed possible to do so. Note that
this was only possible because the only thing that changed was the one
chip, which from a software standpoint should have been identical.
When you only change one thing at a time, it helps you find the
problems.

> I appreciate all your advice which are very helpful and valid, the
> hardware was designed by other contractors, there is limited tools and
> equipment for software guy to debug the hardware. Hardware contractors
> firmly ruled out any issues in hardware, they pointed finger to
> software image built from Yocto to cause the NAND corruption. The

Of course they are saying "it's not my problem". You seem to be living
in a "throw-it-over-the-wall" style organization. In my experience,
you have three choices - get the hell out, become a hardware expert,
or become close friends with one of the guys on the hardware team and
change the culture. You need to realize that in a way they're right -
by their limited perspective, ie the electrons go where they should
go, everything is fine.  But gluing a few chips down on a board is
only 2% of an embedded engineer's job; there's a lot more to do
because our chips are now so programmable, and at the end of the day
it still needs to work right.

Again, examine and understand the datasheets. Check against the
schematic. Satisfy yourself there's no electrical errors. Validate
that every value that gets set in a controller register is the correct
one.

> Yocto image contains all open sources, Linux kernel, connman, MTD,
> ofono, etc, so I try to figure out if there are limitations and
> constrains to turn the device power off while it may be in the middle
> of erasing pages, would that cause the NAND flash corrupted? Or we
> might not set up things properly?
>

UBI is designed to be power-cut safe. Not to say there hasn't been
bugs or isn't a bug.  But basic things like "the power cut happened
when we erased a page" shouldn't be a concern unless there's a driver
bug.

> As you said, there are so many things in software and hardware could
> cause the NAND corruption, what I am particular interested in is if so
> called a bad Yocto image could cause the NAND corruption, let's make
> it clear I am not talking about software problems in that image, I am
> talking about Yocto build system problem which generated a bad image.
> I thought, if you built a bad image, it would not be able to run at
> first time, if an image to run NAND booting well for several days,
> what that the Yocto build system could to make the image corrupted the
> NAND  late like a virus? It does not make sense to me, but I could be
> wrong.

While I would say "nothing is impossible", the scenario you're talking
about is close to it. You're looking down a dry hole. If you can
successfully build and run it, it's not a "bad Yocto image" in the
sense you're describing.

Data loss over time is usually a bit-rot issue in my experience. Check
your ECC parameters, check that ECC data is being written correctly in
ALL cases. Check that U-boot and the kernel agree on the ECC
parameters.  Note - you should be seeing ECC warning messages in your
kernel log on boot, but to my eye you don't have the right settings
there to see what you need to see.

Also, be sure you're not filling up the filesystem, or have partitions
overlapping and thus overwriting each other. One type of "bad image"
is where the new one it is larger than expected, and thus when you
flash it you either overwrite some other area or you truncate what you
are flashing. In some cases like these, it is possible for it to run
normally...until it doesn't. But those types of things are easy to
check.

Check that your method of flashing doesn't ignore bad-block markers.
Or that when you flash it that ECC data gets written correctly.

To give you an example: I had a system where the typically method of
flashing the system for production was via u-boot.  I worked with it
every day reflashing via u-boot. I never had a problem.  We got random
reports of problems from the field and even some of my colleagues
would see corrupted (couldn't boot) NAND after a while.  Going through
the problem, I discovered that our u-boot method of flashing worked
perfect and correctly wrote ECC.  I eventually discovered that our
user-space update script would not write ECC but instead left it
cleared. And, at least with the version of UBIFS we were using, UBIFS
was tolerant of no ECC data. Basically, it would read the page and if
there was ECC data it would validate it and correct or error out if
there was a problem, but if there was no ECC data for the page, it
would short-circuit and basically say "nothing for me to check, all is
OK". So, in the short term anything flashed with the buggy update
script would run fine and would only show problems weeks or so later.
And of course, only if a bit flip happened in the wrong place, etc....
So it was rare enough it took a while to notice. But devices that were
never updated (only production flashed) would be fine.  Or ones that
got upgraded via the u-boot method, like I happened to do, never saw
the problems. Hard to track down and only found because I went through
everything. Hence my comment of "ALL cases".

Hope that helps,
- Steve

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/



[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux