On Mon, Oct 30, 2023 at 8:31 PM Adrian Hunter <adrian.hunter@xxxxxxxxx> wrote: > > On 27/10/23 17:56, Kornel Dulęba wrote: > > This fix addresses a stale task completion event issued right after the > > CQE recovery. As it's a hardware issue the fix is done in form of a > > quirk. > > > > When error interrupt is received the driver runs recovery logic is run. > > It halts the controller, clears all pending tasks, and then re-enables > > it. On some platforms a stale task completion event is observed, > > regardless of the CQHCI_CLEAR_ALL_TASKS bit being set. > > > > This results in either: > > a) Spurious TC completion event for an empty slot. > > b) Corrupted data being passed up the stack, as a result of premature > > completion for a newly added task. > > > > To fix that re-enable the controller, clear task completion bits, > > interrupt status register and halt it again. > > This is done at the end of the recovery process, right before interrupts > > are re-enabled. > > > > Signed-off-by: Kornel Dulęba <korneld@xxxxxxxxxxxx> > > --- > > drivers/mmc/host/cqhci-core.c | 42 +++++++++++++++++++++++++++++++++++ > > drivers/mmc/host/cqhci.h | 1 + > > 2 files changed, 43 insertions(+) > > > > diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c > > index b3d7d6d8d654..e534222df90c 100644 > > --- a/drivers/mmc/host/cqhci-core.c > > +++ b/drivers/mmc/host/cqhci-core.c > > @@ -1062,6 +1062,45 @@ static void cqhci_recover_mrqs(struct cqhci_host *cq_host) > > /* CQHCI could be expected to clear it's internal state pretty quickly */ > > #define CQHCI_CLEAR_TIMEOUT 20 > > > > +/* > > + * During CQE recovery all pending tasks are cleared from the > > + * controller and its state is being reset. > > + * On some platforms the controller sets a task completion bit for > > + * a stale(previously cleared) task right after being re-enabled. > > + * This results in a spurious interrupt at best and corrupted data > > + * being passed up the stack at worst. The latter happens when > > + * the driver enqueues a new request on the problematic task slot > > + * before the "spurious" task completion interrupt is handled. > > + * To fix it: > > + * 1. Re-enable controller by clearing the halt flag. > > + * 2. Clear interrupt status and the task completion register. > > + * 3. Halt the controller again to be consistent with quirkless logic. > > + * > > + * This assumes that there are no pending requests on the queue. > > + */ > > +static void cqhci_quirk_clear_stale_tc(struct cqhci_host *cq_host) > > +{ > > + u32 reg; > > + > > + WARN_ON(cq_host->qcnt); > > + cqhci_writel(cq_host, 0, CQHCI_CTL); > > + if ((cqhci_readl(cq_host, CQHCI_CTL) & CQHCI_HALT)) { > > + pr_err("%s: cqhci: CQE failed to exit halt state\n", > > + mmc_hostname(cq_host->mmc)); > > + } > > + reg = cqhci_readl(cq_host, CQHCI_TCN); > > + cqhci_writel(cq_host, reg, CQHCI_TCN); > > + reg = cqhci_readl(cq_host, CQHCI_IS); > > + cqhci_writel(cq_host, reg, CQHCI_IS); > > + > > + /* > > + * Halt the controller again. > > + * This is only needed so that we're consistent across quirk > > + * and quirkless logic. > > + */ > > + cqhci_halt(cq_host->mmc, CQHCI_FINISH_HALT_TIMEOUT); > > +} > > Thanks a lot for tracking this down! > > It could be that the "un-halt" starts a task, so it would be > better to force the "clear" to work if possible, which > should be the case if CQE is disabled. > > Would you mind trying the code below? Note the increased > CQHCI_START_HALT_TIMEOUT helps avoid trying to clear tasks > when CQE has not halted. Sure, I'll try it out tomorrow, as I don't have access to the DUT today. BTW do we even need to halt the controller in the recovery_finish logic? It has already been halted in recovery_start, I guess it could be there in case the recovery_start halt didn't work. But in that case shouldn't we do this disable/re-enable dance in recovery_start? > > > diff --git a/drivers/mmc/host/cqhci-core.c b/drivers/mmc/host/cqhci-core.c > index b3d7d6d8d654..534c13069833 100644 > --- a/drivers/mmc/host/cqhci-core.c > +++ b/drivers/mmc/host/cqhci-core.c > @@ -987,7 +987,7 @@ static bool cqhci_halt(struct mmc_host *mmc, unsigned int timeout) > * layers will need to send a STOP command), so we set the timeout based on a > * generous command timeout. > */ > -#define CQHCI_START_HALT_TIMEOUT 5 > +#define CQHCI_START_HALT_TIMEOUT 500 > > static void cqhci_recovery_start(struct mmc_host *mmc) > { > @@ -1075,28 +1075,27 @@ static void cqhci_recovery_finish(struct mmc_host *mmc) > > ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT); > > - if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT)) > - ok = false; > - > /* > * The specification contradicts itself, by saying that tasks cannot be > * cleared if CQHCI does not halt, but if CQHCI does not halt, it should > * be disabled/re-enabled, but not to disable before clearing tasks. > * Have a go anyway. > */ > - if (!ok) { > - pr_debug("%s: cqhci: disable / re-enable\n", mmc_hostname(mmc)); > - cqcfg = cqhci_readl(cq_host, CQHCI_CFG); > - cqcfg &= ~CQHCI_ENABLE; > - cqhci_writel(cq_host, cqcfg, CQHCI_CFG); > - cqcfg |= CQHCI_ENABLE; > - cqhci_writel(cq_host, cqcfg, CQHCI_CFG); > - /* Be sure that there are no tasks */ > - ok = cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT); > - if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT)) > - ok = false; > - WARN_ON(!ok); > - } > + if (!cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT)) > + ok = false; > + > + cqcfg = cqhci_readl(cq_host, CQHCI_CFG); > + cqcfg &= ~CQHCI_ENABLE; > + cqhci_writel(cq_host, cqcfg, CQHCI_CFG); > + > + cqcfg = cqhci_readl(cq_host, CQHCI_CFG); > + cqcfg |= CQHCI_ENABLE; > + cqhci_writel(cq_host, cqcfg, CQHCI_CFG); > + > + cqhci_halt(mmc, CQHCI_FINISH_HALT_TIMEOUT); > + > + if (!ok) > + cqhci_clear_all_tasks(mmc, CQHCI_CLEAR_TIMEOUT); > > cqhci_recover_mrqs(cq_host); > >