Re: MMC card detection may trip watchdog if card powered off

Ulf Hansson <ulf.hansson@xxxxxxxxxx> · Thu, 21 Nov 2024 11:58:41 +0100

On Wed, 20 Nov 2024 at 17:21, Anthony Pighin (Nokia)
<anthony.pighin@xxxxxxxxx> wrote:
>
> If card detection is done via polling, due to broken-cd (Freescale LX2160, etc.), or for other reasons, then the card will be polled asynchronously and periodically.
>
> If that polling happens after the card has been put in powered off state (i.e. during system shutdown/reboot), then the polling times out. That timeout is of a long duration (10s). And it is repeated multiple times (x3). And that is all done after the watchdogd has been disabled, meaning that system watchdogs are not being kicked.
>
> If the MMC polling exceeds the watchdog trip time, then the system will be ungraciously reset. Or in the case of a pretimeout capable watchdog, the pretimeout will trip unnecessarily.
>
>     [   46.872767] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg 03220301 flags 0000049d
>     [   46.880258] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000001
>     [   46.886082] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000002
>     [   46.891906] mmc_request_done:187: mmc1: req done (CMD6): 0: 00000800 00000000 00000000 00000000
>     [   46.900606] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0 cs 0 Vdd 0 width 1 timing 0
>     [   46.914934] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg 00010000 flags 00000195
>     [   57.433351] mmc1: Timeout waiting for hardware cmd interrupt.

Hmm. How is the polling being done? By using MMC_CAP_NEEDS_POLL?

The above certainly looks weird. The mmc_rescan work should not be
allowed to run at this point, as the work is getting punted to the
system_freezable_wq via mmc_schedule_delayed_work().

>     ...
>     [   71.031911] [Redacted] 2030000.i2c:[Redacted]@41:watchdog: Watchdog interrupt received!
>     [   71.039737] Kernel panic - not syncing: watchdog pretimeout event
>     [   71.045820] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O       6.6.59 #1
>     [   71.053207] Hardware name: [Redacted]
>     [   71.059897] Call trace:
>     [   71.062332]  dump_backtrace+0x9c/0x128
>     ...
>
> CMD6 is SWITCH_FUNC and arg 03220301 is POWER_OFF_NOTIFICATION (bits 16:23 = 0x22 = 34).
> CMD13 is SEND_STATUS, and when it occurs after the POWER_OFF_NOTIFICATION (as above) bad things happen.
>
> I have made the following change to attempt to work around the issue, and it seems to hold up, but is also quite brute force:
>
> --- a/drivers/mmc/core/mmc.c
> +++ b/drivers/mmc/core/mmc.c
> @@ -2046,6 +2046,11 @@ static void mmc_remove(struct mmc_host *host)
>   */
>  static int mmc_alive(struct mmc_host *host)
>  {
> +       if (host->card && mmc_card_suspended(host->card)) {
> +               pr_err("%s: Skip card detection: Card suspended\n",
> +                      mmc_hostname(host));
> +               return -ENOMEDIUM;
> +       }
>         return mmc_send_status(host->card, NULL);
>  }

Yeah, the above isn't really the correct solution in my opinion.

We need to prevent the mmc_rescan work from running, which I thought
we already did...

>
> Anthony
>
>

Kind regards
Uffe