On 26/10/23 11:29, Kornel Dulęba wrote: > On Thu, Oct 26, 2023 at 10:14 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: >> >> On 26/10/23 10:56, Kornel Dulęba wrote: >>> On Thu, Oct 26, 2023 at 8:25 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: >>>> >>>> On 25/10/23 11:01, Kornel Dulęba wrote: >>>>> On Mon, Oct 23, 2023 at 1:38 PM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: >>>>>> >>>>>> On 20/10/23 11:53, Kornel Dulęba wrote: >>>>>>> On Fri, Oct 20, 2023 at 9:41 AM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: >>>>>>>> >>>>>>>> On 16/10/23 12:56, Kornel Dulęba wrote: >>>>>>>>> There are several reasons for controller to generate an error interrupt. >>>>>>>>> They include controller<->card timeout, and CRC mismatch error. >>>>>>>>> Right now we only get one line in the logs stating that CQE recovery was >>>>>>>>> triggered, but with no information about what caused it. >>>>>>>>> To figure out what happened be more verbose and dump the registers from >>>>>>>>> irq error handler logic. >>>>>>>>> This matches the behaviour of the software timeout logic, see >>>>>>>>> cqhci_timeout. >>>>>>>>> >>>>>>>>> Signed-off-by: Kornel Dulęba <korneld@xxxxxxxxxxxx> >>>>>>>>> --- >>>>>>>>> drivers/mmc/host/cqhci-core.c | 5 +++-- >>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c >>>>>>>>> index b3d7d6d8d654..33abb4bd53b5 100644 >>>>>>>>> --- a/drivers/mmc/host/cqhci-core.c >>>>>>>>> +++ b/drivers/mmc/host/cqhci-core.c >>>>>>>>> @@ -700,8 +700,9 @@ static void cqhci_error_irq(struct mmc_host *mmc, u32 status, int cmd_error, >>>>>>>>> >>>>>>>>> terri = cqhci_readl(cq_host, CQHCI_TERRI); >>>>>>>>> >>>>>>>>> - pr_debug("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d TERRI: 0x%08x\n", >>>>>>>>> - mmc_hostname(mmc), status, cmd_error, data_error, terri); >>>>>>>>> + pr_warn("%s: cqhci: error IRQ status: 0x%08x cmd error %d data error %d\n", >>>>>>>>> + mmc_hostname(mmc), status, cmd_error, data_error); >>>>>>>>> + cqhci_dumpregs(cq_host); >>>>>>>> >>>>>>>> For debugging, isn't dynamic debug seems more appropriate? >>>>>>> >>>>>>> Dynamic debug is an option, but my personal preference would be to >>>>>>> just log more info in the error handler. >>>>>> >>>>>> Interrupt handlers can get called very rapidly, so some kind of rate >>>>>> limiting should be used if the message is unconditional. Also you need >>>>>> to provide actual reasons for your preference. >>>>>> >>>>>> For dynamic debug of the register dump, something like below is >>>>>> possible. >>>>>> >>>>>> #define cqhci_dynamic_dumpregs(cqhost) \ >>>>>> _dynamic_func_call_no_desc("cqhci_dynamic_dumpregs", cqhci_dumpregs, cqhost) >>>>>> >>>>> Fair point. >>>>> The reason I'm not a fan of using dynamic debug for this is that my >>>>> goal here is to improve the warning/error logging information that we >>>>> get from systems running in production. >>>>> I.e. if we get a lot of "running CQE recovery" messages, at the very >>>>> least I'd like to know what is causing them. >>>> >>>> So you are saying you want to collect debug information from production >>>> systems, but don't want to use dynamic debug to do it? >>>> >>>>>>> To give you some background. >>>>>>> We're seeing some "running CQE recovery" lines in the logs, followed >>>>>>> by a dm_verity mismatch error. >>>>>>> The reports come from the field, with no feasible way to reproduce the >>>>>>> issue locally. >>>>>> >>>>>> If it is a software error, some kind of error injection may well >>>>>> reproduce it. Also if it is a hardware error that only happens >>>>>> during recovery, error injection could increase the likelihood of >>>>>> reproducing it. >>>>> >>>>> We tried software injection and it didn't yield any results. >>>>> We're currently looking into "injecting" hw errors by using a small >>>>> burst field generator to interfere with transfers on the data line >>>>> directly. >>>> >>>> I just tried instrumenting a driver to inject CRC errors and it >>>> revealed several CQE recovery issues, including spurious TCN for >>>> tag 31. I will send some patches when they are ready. >>> >>> Sorry, what I meant by it didn't yield results is that Ii didn't >>> trigger the dm-verity error that we're seeing on production. >>> With SW injection there are two potential issues that come to my mind: >>> >>> 1. In the cqhci_error_irq when TERRI is not valid only a single, >>> "random" task is marked as bad. >>> Then in cqhci_recover_mrq we're marking all pending requests as done. >>> For data transfers this is somewhat bening as it will return with >>> bytes_xfered=0. >>> IIUC this will then cause the upper layer to re-enqueue this request. >> >> Yes >> >>> The bigger problem is a CMD only mrq, which will be mistakenly marked >>> as completed successfully. >> >> I noticed that also. Notably the only non-data CMD is cache flush. >> >> There are several other issues, but patches will describe >> them better. > > Sure. :) > > jfyi, I've just managed to inject CRC errors by using a burst field generator. > (Langer P23 if you're interested.) > I'm using it by touching the D0 MMC line directly, and it yields > surprisingly good results. > I've changed the spurious TCN WARN_ONCE to pr_warn and got the following: > > [ 71.885698] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0 > data error -84 TERRI: 0x972e0000 > [ 71.885730] mmc1: running CQE recovery > [ 71.888135] cqhci_recovery_finish: TCN: 0x00000000 > [ 71.888141] mmc1: cqhci: recovery done > [ 71.888223] mmc1: cqhci: spurious TCN for tag 23 > (...) > [ 95.558736] mmc1: cqhci: error IRQ status: 0x00000000 cmd error 0 > data error -84 TERRI: 0x822e0000 > [ 95.558768] mmc1: running CQE recovery > [ 95.561073] cqhci_recovery_finish: TCN: 0x00000000 > [ 95.561078] mmc1: cqhci: recovery done > [ 95.561288] device-mapper: verity: 179:3: data block 712181 is corrupted > > Now I get a spurious TCN after every recovery, with the only exception > being the one that ends up with dm-verity error. > So it'd seem that there's a race in which the "spurious" TCN hits a > pending request, enqueued right after recovery was completed. > I'm currently looking into how to fix it, but if you beat me to it I > can also test your patches and see if it fixes the dm-verity issue. OK, but here are some hacks to try: diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c index 3a8f27c3e310..5eb98da933d5 100644 --- a/drivers/mmc/core/block.c +++ b/drivers/mmc/core/block.c @@ -2379,6 +2379,9 @@ enum mmc_issued mmc_blk_mq_issue_rq(struct mmc_queue *mq, struct request *req) blk_mq_end_request(req, BLK_STS_OK); return MMC_REQ_FINISHED; } + ret = mmc_blk_wait_for_idle(mq, host); + if (ret) + return MMC_REQ_BUSY; ret = mmc_blk_cqe_issue_flush(mq, req); break; case REQ_OP_READ: diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c index 3d3e0ca52614..8ac1efca2be6 100644 --- a/drivers/mmc/core/core.c +++ b/drivers/mmc/core/core.c @@ -563,6 +563,16 @@ int mmc_cqe_recovery(struct mmc_host *host) host->cqe_ops->cqe_recovery_finish(host); + if (err) { + memset(&cmd, 0, sizeof(cmd)); + cmd.opcode = MMC_CMDQ_TASK_MGMT; + cmd.arg = 1; /* Discard entire queue */ + cmd.flags = MMC_RSP_R1B | MMC_CMD_AC; + cmd.flags &= ~MMC_RSP_CRC; /* Ignore CRC */ + cmd.busy_timeout = MMC_CQE_RECOVERY_TIMEOUT; + err = mmc_wait_for_cmd(host, &cmd, 3); + } + mmc_retune_release(host); return err; diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c index b3d7d6d8d654..93e1bbdc446f 100644 --- a/drivers/mmc/host/cqhci-core.c +++ b/drivers/mmc/host/cqhci-core.c @@ -975,8 +975,13 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout) ret = cqhci_halted(cq_host); - if (!ret) - pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); + if (!ret) { + u32 cqcfg; + + cqcfg = cqhci_readl(cq_host, CQHCI_CFG); + cqcfg &= ~CQHCI_ENABLE; + cqhci_writel(cq_host, cqcfg, CQHCI_CFG); + } return ret; } @@ -987,7 +992,7 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout) * layers will need to send a STOP command), so we set the timeout based on a * generous command timeout. */ -#define CQHCI_START_HALT_TIMEOUT 5 +#define CQHCI_START_HALT_TIMEOUT 200 static void cqhci_recovery_start(struct mmc_host *mmc) { @@ -1039,6 +1044,8 @@ static void cqhci_recover_mrq(struct cqhci_host *cq_host, unsigned int tag) data->error = cqhci_error_from_flags(slot->flags); } else { mrq->cmd->error = cqhci_error_from_flags(slot->flags); + if (!mrq->cmd->error) + mrq->cmd->error = -EIO; } mmc_cqe_request_done(cq_host->mmc, mrq); > >> >>> >>> 2. As for the spurious task completion warning. >>> I initially thought that it was bening. >>> The check for !mrq is done before checking if we're currently doing recovery. >>> So if it's called just right at the end of recovery, right after the >>> cqhci_recover_mrqs is executed that would explain it. >>> With that being said if that irq handler is run right after the >>> recovery is finished we'll end up with a race where a new request, >>> that was just enqueued, might be mistakenly marked as done. >>> This would explain the dm-verity errors we're seeing. >>> >>>> >>>>>> >>>>>>> >>>>>>> I'd argue that logging only the info that CQE recovery was executed is >>>>>>> not particularly helpful for someone looking into those logs. >>>>>> >>>>>> As the comment says, that message is there because recovery reduces >>>>>> performance, it is not to aid debugging per se. >>>>>> >>>>>>> Ideally we would have more data about the state the controller was in >>>>>>> when the error happened, or at least what caused the recovery to be >>>>>>> triggered. >>>>>>> The question here is how verbose should we be in this error scenario. >>>>>>> Looking at other error scenarios, in the case of a software timeout >>>>>>> we're dumping the controller registers. (cqhci_timeout) >>>>>> >>>>>> Timeout means something is broken - either the driver, the cq engine >>>>>> or the card. On the other hand, an error interrupt is most likely a >>>>>> CRC error which is not unexpected occasionally, due to thermal drift >>>>>> or perhaps interference. >>>>> >>>>> Right, but my point is that we don't know what triggered CQE recovery. >>>> >>>> True, although probably a CRC error. >>>> >>>>> >>>>>> >>>>>>> Hence I thought that I'd be appropriate to match that and do the same >>>>>>> in CQE recovery logic. >>>>>> >>>>>> It needs to be consistent. There are other pr_debugs, such as: >>>>>> >>>>>> pr_debug("%s: cqhci: Failed to clear tasks\n", >>>>>> pr_debug("%s: cqhci: Failed to halt\n", mmc_hostname(mmc)); >>>>>> pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); >>>>>> >>>>>> which should perhaps be treated the same. >>>>>> >>>>>> And there are no messages for errors from the commands in >>>>>> mmc_cqe_recovery(). >>>>> >>>>> How about this. >>>>> As a compromise would it be okay to just do a single pr_warn directly >>>>> from cqhci_error_irq. >>>> >>>> Sure, printk_ratelimited() or __ratelimit() >>>> >>>>> We could simply promote the existing pr_debug to pr_warn at the >>>>> beginning of that function. >>>>> This would tell us what triggered the recovery. (controller timeout, >>>>> CRC mismatch) >>>>> We can also consider removing the "running CQE recovery" print for the >>>>> sake of brevity. >>>> >>>> No, that serves a different purpose. >>>> >>>>> The only downside of this that I can see is that we'd be running the >>>>> logic from the interrupt handler directly, but I can't see an easy way >>>>> around that. >>>>> What do you think? >>>> >>>> Should be OK with rate limiting. >>> >>> OK, I'll look into the rate limiting and will send a v2. >>> >>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> /* Forget about errors when recovery has already been triggered */ >>>>>>>>> if (cq_host->recovery_halt) >>>>>>>> >>>>>> >>>> >>