Re: [PATCH v1 11/11] mtd: new support oops logger based on pstore/blk

liaoweixiong <liaoweixiong@xxxxxxxxxxxxxxxxx> · Fri, 7 Feb 2020 18:30:37 +0800

hi Miquel Raynal,

On 2020/2/7 下午4:41, Miquel Raynal wrote:
Hi Liao,

liaoweixiong <liaoweixiong@xxxxxxxxxxxxxxxxx> wrote on Fri, 7 Feb 2020
12:13:08 +0800:

hi Miquel Raynal,

On 2020/2/6 PM 11:45, Miquel Raynal wrote:
Hi liao,

liaoweixiong <liaoweixiong@xxxxxxxxxxxxxxxxx> wrote on Thu, 6 Feb 2020
21:10:47 +0800:

hi Miquel Raynal,

On 2020/1/23 AM 1:41, Miquel Raynal wrote:
Hello,

    >>>>>>>> +/*
+ * All zones will be read as pstore/blk will read zone one by one when do
+ * recover.
+ */
+static ssize_t mtdpstore_read(char *buf, size_t size, loff_t off)
+{
+	struct mtdpstore_context *cxt = &oops_cxt;
+	size_t retlen;
+	int ret;
+
+	if (mtdpstore_block_isbad(cxt, off))
+		return -ENEXT;
+
+	pr_debug("try to read off 0x%llx size %zu\n", off, size);
+	ret = mtd_read(cxt->mtd, off, size, &retlen, (u_char *)buf);
+	if ((ret < 0 && !mtd_is_bitflip(ret)) || size != retlen)  {

IIRC size != retlen does not mean it failed, but that you should
continue reading after retlen bytes, no?
      >>
Yes, you are right. I will fix it. Thanks.
    >>>>> Also, mtd_is_bitflip() does not mean that you are reading a false
buffer, but that the data has been corrected as it contained bitflips.
mtd_is_eccerr() however, would be meaningful.
      >>
Sure I know mtd_is_bitflip() does not mean failure, but I do not think
mtd_is_eccerr() should be here since the codes are ret < 0 and NOT
mtd_is_bitflip().

Yes, just drop this check, only keep ret < 0.
     >>
If I don't get it wrong, it should not	 be dropped here. Like your words,
"mtd_is_bitflip() does not mean that you are reading a false buffer,
but that the data has been corrected as it contained bitflips.", the
data I get are valid even if mtd_is_bitflip() return true. It's correct
data and it's no need to go to handle error. To me, the codes
should be:
	if (ret < 0 && !mit_is_bitflip())
		[error handling]

Please check the implementation of mtd_is_bitflip(). You'll probably
figure out what I am saying.

https://elixir.bootlin.com/linux/latest/source/include/linux/mtd/mtd.h#L585
    >>
How about the codes as follows:

for (done = 0, retlen = 0; done < size; done += retlen) {
	ret = mtd_read(..., &retlen, ...);
	if (!ret)
		continue;
	/*
	 * do nothing if bitflip and ecc error occurs because whether
	 * it's bitflip or ECC error, just a small number of bits flip
	 * and the impact on log data is so small. The mtdpstore just
	 * hands over what it gets and user can judge whether the data
	 * is valid or not.
	 */
	if (mtd_is_bitflip(ret)) {
		dev_warn("bitflip at....");
		continue;

I don't understand why do you check for bitflips. Bitflips have been
corrected at this stage, you just get the information that there
has been bitflips, but the data integrity is fine.

Both of bitflip and eccerror are not real wrong in this
case. So we must check them.

I am not against ignoring ECC errors in this case though. I would
propose:

	for (...) {
		if (ret < 0) {
			complain;
			return;
		}

-117 (-EUCLEAN) means bitflip but be corrected.
-74 (-EBADMSG) means ecc error that uncorrectable
All of them are negative number that smaller than 0. If it just keeps
"ret < 0", it can never make a difference between bitflip/eccerror
and others.

I forgot about these, your're right, so what about:

	static bool mtdpstore_is_io_error(int ret)
	{
		return ret < 0 && !mtd_is_eccerr(ret)> && !mtd_is_bitflip(ret);
	}

	for (...) {
		if (mtdpstore_is_io_error(ret))
			return ret;

		/*
		 * Comment explaining why we still return valid data
		 * despite ECC errors.
		 */
		if (mtd_is_eccerr(ret))
			just-complain();
	}

This snippet makes a difference between any "controller/bus
timeout/error : data could not be retrieved" and "ECC error, maybe we
can still read it and try to understand (risky, must be warned)".

That seems good to me. I will fix it later.
Thanks for your review.

		if (mtd_is_eccerr())
			complain;
	}

	} else if (mtd_is_eccerr(ret)) {
		dev_warn("eccerr at....");
		retlen = retlen == 0 ? size : retlen;
		continue;
	} else {
		dev_err("read failure at...");
		/* this zone is broken, try next one */
		return -ENEXT;
	}
}

Thanks,
Miquèl

Thanks,
Miquèl