RE: [EXT] Re: [PATCH RFC] mtd: rawnand: Cure MICRON NAND partial erase issue

"Bean Huo (beanhuo)" <beanhuo@xxxxxxxxxx> · Wed, 19 Dec 2018 11:04:52 +0000

Hi, all
Micron developed one patch related to what mentioned here. which is based on v4.2.
I just submitted it here http://lists.infradead.org/pipermail/linux-mtd/2018-December/086446.html
, please review. And I will update that later based on your comments.

>+Bean,
>
>Hi Thomas,
>
>First of all, I'd like to thank you for sharing this patch. I'm pretty sure this will
>save days of painful debug sessions to a lot of people.
>
>On Thu, 29 Nov 2018 22:12:50 +0100 (CET) Thomas Gleixner
><tglx@xxxxxxxxxxxxx> wrote:
>
>> On some Micron NAND chips block erase fails occasionaly despite the
>> chip claiming that it succeeded. The flash block seems to be not
>> completely erased and subsequent usage of the block results in hard to
>> decode and very subtle failures or corruption.
>>
>> The exact reason is unknown, but experimentation has shown that it is
>> only happening when erasing an erase block which is partially written.
>> Partially written erase blocks are not uncommon with UBI/UBIFS.  Note,
>> that this does not always happen. It's a rare and random, but eventually
>fatal failure.
>>
>> For now, just blindly write 6 pages to 0. Again experimentation has
>> shown that it's not sufficient to write pages at the beginning of the
>> erase block. There need to be pages written in the second half of the
>> erase block as well. So write 3 pages before and past the middle of the block.
>>
>> Less than 6 pages might be sufficient, but it might even be necessary
>> to write more pages to make sure that it's completely cured. Two pages
>> still failed, but the 6 held up in a stress test scenario.
>>
>> This should be optimized by keeping track of writes, but that needs
>> proper information about the issue.
>>
>> As it's just observation and experimentation based, it's probably wise
>> to hold off on this until there is proper clarification about the root
>> cause of the problem. The patch is for reference so others can avoid
>> to decode this again, but there is no guarantee that it actually fixes
>> the issue completely.
>
>I agree. I Cc-ed Bean from Micron. Maybe he can provide more information
>on this issue.
>
>>
>> Therefore:
>>
>> Not-yet-signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>>
>> Cc: Boris Brezillon <boris.brezillon@xxxxxxxxxxx>
>> Cc: Miquel Raynal <miquel.raynal@xxxxxxxxxxx>
>> Cc: Richard Weinberger <richard@xxxxxx>
>>
>> ---
>>
>> P.S.: This was debugged on an older kernel version (sigh) and ported
>>       forward without actual testing on mainline. My MTD foo is a bit
>>       rusty, so I won't be surprised if there are better ways to do that.
>
>Let's first wait for Bean's feedback before discussing implementation details.
>BTW, do you remember the part number(s) of the flash(es) impacted by this
>problem in your case?
>
>Thanks,
>
>Boris
>
>>
>> ---
>>  drivers/mtd/nand/raw/nand_base.c   |   89
>+++++++++++++++++++++++++++++++++++++
>>  drivers/mtd/nand/raw/nand_micron.c |    6 ++
>>  include/linux/mtd/rawnand.h        |    3 +
>>  3 files changed, 98 insertions(+)
>>
>> --- a/drivers/mtd/nand/raw/nand_base.c
>> +++ b/drivers/mtd/nand/raw/nand_base.c
>> @@ -4122,6 +4122,91 @@ static int nand_erase(struct mtd_info *m
>>  	return nand_erase_nand(mtd_to_nand(mtd), instr, 0);  }
>>
>> +static bool page_empty(char *buf, int len) {
>> +	unsigned int *p = (unsigned int *) buf;
>> +	int i;
>> +
>> +	for (i = 0; i < len >> 2; i++, p++) {
>> +		if (*p != UINT_MAX)
>> +			return false;
>> +	}
>> +	return true;
>> +}
>> +
>> +#define NAND_ERASE_QUIRK_PAGES		6
>> +
>> +/**
>> + * nand_erase_quirk - [INTERN] Work around partial erase issues
>> + * @chip:	NAND chip object
>> + * @page:	Eraseblock base page number
>> + *
>> + * On some Micron NAND chips block erase fails occasionaly despite
>> +the chip
>> + * claiming that it succeeded. The flash block seems to be not
>> +completely
>> + * erased and subsequent usage of the block results in hard to decode
>> +and
>> + * very subtle failures or corruption.
>> + *
>> + * The exact reason is unknown, but experimentation has shown that it
>> +is
>> + * only happening when erasing an erase block which is only partially
>> + * written. Partially written erase blocks are not uncommon with UBI/UBIFS.
>> + * Note, that this does not always happen. It's a rare and random,
>> +but
>> + * eventually fatal failure.
>> + *
>> + * For now, just blindly write 6 pages to 0. Again experimentation
>> +has
>> + * shown that it's not sufficient to write pages at the beginning of
>> +the
>> + * erase block. There need to be pages written in the second half of
>> +the
>> + * erase block as well. So write 3 pages before and past the middle
>> +of the
>> + * block.
>> + *
>> + * Less than 6 pages might be sufficient, but it might even be
>> +necessary to
>> + * write more pages to make sure that it's completely cured. 2 pages
>> +still
>> + * failed, but the 6 held up in a stress test scenario.
>> + *
>> + * FIXME: This should be optimized by keeping track of writes, but
>> +that
>> + * needs proper information about the issue.
>> + */
>> +static int nand_erase_quirk(struct mtd_info *mtd, int page) {
>> +	struct nand_chip *chip = mtd->priv;
>> +	unsigned int i, offs;
>> +	u8 *buf;
>> +
>> +	if (!(chip->options & NAND_ERASE_QUIRK))
>> +		return 0;
>> +
>> +	buf = kmalloc(mtd->writesize, GFP_KERNEL);
>> +	if (!buf)
>> +		return -ENOMEM;
>> +
>> +	/* Start at (pages_per_block / 2) - 3 */
>> +	offs = 1 << (chip->phys_erase_shift - chip->page_shift);
>> +	offs = (offs >> 1) - (NAND_ERASE_QUIRK_PAGES / 2);
>> +	page = page + offs;
>> +
>> +	for (i = 0; i < NAND_ERASE_QUIRK_PAGES; i++, page++ ) {
>> +		struct mtd_oob_ops ops = {
>> +			.datbuf	= buf,
>> +			.len	= mtd->writesize,
>> +		};
>> +
>> +		/*
>> +		 * Read the page back and check whether it is completely
>> +		 * empty.
>> +		 */
>> +		nand_do_read_ops(mtd, page << chip->page_shift, &ops);
>> +		if (page_empty(buf, mtd->writesize))
>> +			continue;
>> +		memset(buf, 0, mtd->writesize);
>> +		/*
>> +		 * Fill page with zeros. Ignore write failure as there
>> +		 * is no way to recover here.
>> +		 */
>> +		nand_do_write_ops(mtd, page << chip->page_shift, &ops);
>> +	}
>> +	kfree(buf);
>> +	return 0;
>> +}
>> +
>>  /**
>>   * nand_erase_nand - [INTERN] erase block(s)
>>   * @chip: NAND chip object
>> @@ -4186,6 +4271,10 @@ int nand_erase_nand(struct nand_chip *ch
>>  		    (page + pages_per_block))
>>  			chip->pagebuf = -1;
>>
>> +		ret = nand_erase_quirk(mtd, page);
>> +		if (ret)
>> +			goto erase_exit;
>> +
>>  		if (chip->legacy.erase)
>>  			status = chip->legacy.erase(chip,
>>  						    page & chip->pagemask);
>> --- a/drivers/mtd/nand/raw/nand_micron.c
>> +++ b/drivers/mtd/nand/raw/nand_micron.c
>> @@ -447,6 +447,12 @@ static int micron_nand_init(struct nand_
>>  	if (ret)
>>  		goto err_free_manuf_data;
>>
>> +	/*
>> +	 * FIXME: Mark all Micron flash with the ERASE QUIRK bit for now as
>> +	 * it is unclear which flash types are affected/
>> +	 */
>> +	chip->options |= NAND_ERASE_QUIRK;
>> +
>>  	if (mtd->writesize == 2048)
>>  		chip->bbt_options |= NAND_BBT_SCAN2NDPAGE;
>>
>> --- a/include/linux/mtd/rawnand.h
>> +++ b/include/linux/mtd/rawnand.h
>> @@ -163,6 +163,9 @@ enum nand_ecc_algo {
>>  /* Device needs 3rd row address cycle */
>>  #define NAND_ROW_ADDR_3		0x00004000
>>
>> +/* Device requires erase quirk */
>> +#define NAND_ERASE_QUIRK	0x00008000
>> +
>>  /* Options valid for Samsung large page devices */  #define
>> NAND_SAMSUNG_LP_OPTIONS NAND_CACHEPRG
>>

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/