Re: [PATCH] mmc: cqhci: Be more verbose in error irq handler

Kornel Dulęba <korneld@xxxxxxxxxxxx> · Thu, 26 Oct 2023 10:29:59 +0200

On Thu, Oct 26, 2023 at 10:14 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
>
> On 26/10/23 10:56, Kornel Dulęba wrote:
> > On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
> >>
> >> On 25/10/23 11:01, Kornel Dulęba wrote:
> >>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
> >>>>
> >>>> On 20/10/23 11:53, Kornel Dulęba wrote:
> >>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote:
> >>>>>>
> >>>>>> On 16/10/23 12:56, Kornel Dulęba wrote:
> >>>>>>> There are several reasons for controller to generate an error interrupt.
> >>>>>>> They include controller<->card timeout, and CRC mismatch error.
> >>>>>>> Right now we only get one line in the logs stating that CQE recovery was
> >>>>>>> triggered, but with no information about what caused it.
> >>>>>>> To figure out what happened be more verbose and dump the registers from
> >>>>>>> irq error handler logic.
> >>>>>>> This matches the behaviour of the software timeout logic, see
> >>>>>>> cqhci_timeout.
> >>>>>>>
> >>>>>>> Signed-off-by: Kornel Dulęba <korneld@xxxxxxxxxxxx>
> >>>>>>> ---
> >>>>>>>  drivers/mmc/host/cqhci-core.c | 5 +++--
> >>>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c
> >>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644
> >>>>>>> --- a/drivers/mmc/host/cqhci-core.c
> >>>>>>> +++ b/drivers/mmc/host/cqhci-core.c
> >>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error,
> >>>>>>>
> >>>>>>>       terri = cqhci_readl(cq_host, CQHCI_TERRI);
> >>>>>>>
> >>>>>>> -     pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n",
> >>>>>>> -              mmc_hostname(mmc), status, cmd_error, data_error, terri);
> >>>>>>> +     pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n",
> >>>>>>> +              mmc_hostname(mmc), status, cmd_error, data_error);
> >>>>>>> +     cqhci_dumpregs(cq_host);
> >>>>>>
> >>>>>> For debugging, isn't dynamic debug seems more appropriate?
> >>>>>
> >>>>> Dynamic debug is an option, but my personal preference would be to
> >>>>> just log more info in the error handler.
> >>>>
> >>>> Interrupt handlers can get called very rapidly, so some kind of rate
> >>>> limiting should be used if the message is unconditional.  Also you need
> >>>> to provide actual reasons for your preference.
> >>>>
> >>>> For dynamic debug of the register dump, something like below is
> >>>> possible.
> >>>>
> >>>> #define cqhci_dynamic_dumpregs(cqhost) \
> >>>>         _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost)
> >>>>
> >>> Fair point.
> >>> The reason I'm not a fan of using dynamic debug for this is that my
> >>> goal here is to improve the warning/error logging information that we
> >>> get from systems running in production.
> >>> I.e. if we get a lot of "running CQE recovery" messages, at the very
> >>> least I'd like to know what is causing them.
> >>
> >> So you are saying you want to collect debug information from production
> >> systems, but don't want to use dynamic debug to do it?
> >>
> >>>>> To give you some background.
> >>>>> We're seeing some "running CQE recovery" lines in the logs, followed
> >>>>> by a dm_verity mismatch error.
> >>>>> The reports come from the field, with no feasible way to reproduce the
> >>>>> issue locally.
> >>>>
> >>>> If it is a software error, some kind of error injection may well
> >>>> reproduce it.  Also if it is a hardware error that only happens
> >>>> during recovery, error injection could increase the likelihood of
> >>>> reproducing it.
> >>>
> >>> We tried software injection and it didn't yield any results.
> >>> We're currently looking into "injecting" hw errors by using a small
> >>> burst field generator to interfere with transfers on the data line
> >>> directly.
> >>
> >> I just tried instrumenting a driver to inject CRC errors and it
> >> revealed several CQE recovery issues, including spurious TCN for
> >> tag 31.  I will send some patches when they are ready.
> >
> > Sorry, what I meant by it didn't yield results is that Ii didn't
> > trigger the dm-verity error that we're seeing on production.
> > With SW injection there are two potential issues that come to my mind:
> >
> > 1. In the cqhci_error_irq when TERRI is not valid only a single,
> > "random" task is marked as bad.
> > Then in cqhci_recover_mrq we're marking all pending requests as done.
> > For data transfers this is somewhat bening as it will return with
> > bytes_xfered=0.
> > IIUC this will then cause the upper layer to re-enqueue this request.
>
> Yes
>
> > The bigger problem is a CMD only mrq, which will be mistakenly marked
> > as completed successfully.
>
> I noticed that also.  Notably the only non-data CMD is cache flush.
>
> There are several other issues, but patches will describe
> them better.

Sure. :)

jfyi, I've just managed to inject CRC errors by using a burst field generator.
(Langer P23 if you're interested.)
I'm using it by touching the D0 MMC line directly, and it yields
surprisingly good results.
I've changed the spurious TCN WARN_ONCE to pr_warn and got the following:

[   71.885698] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0
data error -84 TERRI: 0x972e0000
[   71.885730] mmc1: running CQE recovery
[   71.888135] cqhci_recovery_finish: TCN: 0x00000000
[   71.888141] mmc1: cqhci: recovery done
[   71.888223] mmc1: cqhci: spurious TCN for tag 23
(...)
[   95.558736] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0
data error -84 TERRI: 0x822e0000
[   95.558768] mmc1: running CQE recovery
[   95.561073] cqhci_recovery_finish: TCN: 0x00000000
[   95.561078] mmc1: cqhci: recovery done
[   95.561288] device-mapper: verity: 179:3: data block 712181 is corrupted

Now I get a spurious TCN after every recovery, with the only exception
being the one that ends up with dm-verity error.
So it'd seem that there's a race in which the "spurious" TCN hits a
pending request, enqueued right after recovery was completed.
I'm currently looking into how to fix it, but if you beat me to it I
can also test your patches and see if it fixes the dm-verity issue.

>
> >
> > 2. As for the spurious task completion warning.
> > I initially thought that it was bening.
> > The check for !mrq is done before checking if we're currently doing recovery.
> > So if it's called just right at the end of recovery, right after the
> > cqhci_recover_mrqs is executed that would explain it.
> > With that being said if that irq handler is run right after the
> > recovery is finished we'll end up with a race where a new request,
> > that was just enqueued, might be mistakenly marked as done.
> > This would explain the dm-verity errors we're seeing.
> >
> >>
> >>>>
> >>>>>
> >>>>> I'd argue that logging only the info that CQE recovery was executed is
> >>>>> not particularly helpful for someone looking into those logs.
> >>>>
> >>>> As the comment says, that message is there because recovery reduces
> >>>> performance, it is not to aid debugging per se.
> >>>>
> >>>>> Ideally we would have more data about the state the controller was in
> >>>>> when the error happened, or at least what caused the recovery to be
> >>>>> triggered.
> >>>>> The question here is how verbose should we be in this error scenario.
> >>>>> Looking at other error scenarios, in the case of a software timeout
> >>>>> we're dumping the controller registers. (cqhci_timeout)
> >>>>
> >>>> Timeout means something is broken - either the driver, the cq engine
> >>>> or the card.  On the other hand, an error interrupt is most likely a
> >>>> CRC error which is not unexpected occasionally, due to thermal drift
> >>>> or perhaps interference.
> >>>
> >>> Right, but my point is that we don't know what triggered CQE recovery.
> >>
> >> True, although probably a CRC error.
> >>
> >>>
> >>>>
> >>>>> Hence I thought that I'd be appropriate to match that and do the same
> >>>>> in CQE recovery logic.
> >>>>
> >>>> It needs to be consistent. There are other pr_debugs, such as:
> >>>>
> >>>>                 pr_debug("%s: cqhci: Failed to clear tasks\n",
> >>>>                 pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc));
> >>>>                 pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc));
> >>>>
> >>>> which should perhaps be treated the same.
> >>>>
> >>>> And there are no messages for errors from the commands in
> >>>> mmc_cqe_recovery().
> >>>
> >>> How about this.
> >>> As a compromise would it be okay to just do a single pr_warn directly
> >>> from cqhci_error_irq.
> >>
> >> Sure, printk_ratelimited() or __ratelimit()
> >>
> >>> We could simply promote the existing pr_debug to pr_warn at the
> >>> beginning of that function.
> >>> This would tell us what triggered the recovery. (controller timeout,
> >>> CRC mismatch)
> >>> We can also consider removing the "running CQE recovery" print for the
> >>> sake of brevity.
> >>
> >> No, that serves a different purpose.
> >>
> >>> The only downside of this that I can see is that we'd be running the
> >>> logic from the interrupt handler directly, but I can't see an easy way
> >>> around that.
> >>> What do you think?
> >>
> >> Should be OK with rate limiting.
> >
> > OK, I'll look into the rate limiting and will send a v2.
> >
> >>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>       /* Forget about errors when recovery has already been triggered */
> >>>>>>>       if (cq_host->recovery_halt)
> >>>>>>
> >>>>
> >>
>