Re: [RFC PATCH] UBI fixable bit-flip issue

Mark Spieth <mspieth@xxxxxxxxxxxxxxxxx> · Mon, 20 Aug 2018 10:40:14 +1000

On 18/08/18 01:22, Boris Brezillon wrote:
On Fri, 17 Aug 2018 16:53:22 +0200
Boris Brezillon <boris.brezillon@xxxxxxxxxxx> wrote:

On Sat, 18 Aug 2018 00:33:25 +1000
Mark Spieth <mspieth@xxxxxxxxxxxxxxxxx> wrote:

I hope this description is clear enough.
Well, I think selecting the bitflip threshold properly is really
important, simply because some NANDs (including SLC NANDs) are showing
bitflips even on blocks that have a low EC. Check the NAND ECC
requirements, and if it's something like 8bit/512bytes, I guess that's
more or less expected (it all depends on how many bitflips you have in
the faulty block). It's less likely on NANDs requiring 1bit/512bytes
ECC, and if that happens on such NANDs, you may have a problem in the
controller driver.
4 bits ECC per 512 bytes, from memory 28 bytes in OOB, using software
ECC in the MTD driver.
As I said, I believe the better threshold is hiding the root cause. It
is only a band-aid.
What you describe will anyway happen sooner or later: if you're using
almost al LEBs, and the remaining free ones are all impacted by the
correctable bit-flip issue you'll have to use them anyway. So, yes,
this is a band-aid, just like your solution is just improving things
but not really solving the issue. This being said, if the blocks
really show too many bitflips, they should be marked bad at some point,
because during the scrubbing process we do write a pattern and check
that we can read it back. I'll have to double check, but I think we're
also checking for EUCLEAN and mark the block bad when that happens.
Hm, actually we're not torturing the source PEB when moving a LEB
because of bitflips (probably because it's expensive and tends to wear
the block even faster) :-/. The destination PEB is tortured if we fail
to read the VID header back, which is definitely not a guarantee that
other data are readable or do not contain too much bitflips.

There's definitely something to improve there.
Hi Boris,

The flash in use is a Macronix MX30LF1G18AC and uses ONFI mode.

My understanding of the problem is that when a block is read (say 
kernel+initrd) and one of the PEBs reads ok but with corrected bit 
errors, scrub mode is enabled.
It then finds a suitable PEB to copy it to which it does. It then 
verifies this copy and also detects a corrected bit error, and frees the 
PEB it copied it from as it read ok, but with corrected errors. It then 
finds a suitable PEB to copy it to, and finds the original PEB that it 
moved it from! Does the whole copy and readback verify with corrected 
errors. This continues forever (or until the PEB does not verify which 
could be a while). Naturally the block read never completes.

This is the behaviour I observed in the older driver with lots of print 
debugging. This may not be the behaviour in the current master, but I 
suspect it is.
Some way of detecting this loop in a scrubbing session would be optimal, 
but seems complex to do from my examination of the UBI scrubber. But it 
shouldnt require a persisted header change.

Regards
Mark

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/