RE: MMC card detection may trip watchdog if card powered off

"Anthony Pighin (Nokia)" <anthony.pighin@xxxxxxxxx> · Mon, 25 Nov 2024 18:07:03 +0000

> > > > > >
> > > > > > If card detection is done via polling, due to broken-cd
> > > > > > (Freescale LX2160,
> > > > > etc.), or for other reasons, then the card will be polled
> > > > > asynchronously and periodically.
> > > > > >
> > > > > > If that polling happens after the card has been put in powered
> > > > > > off state
> > > (i.e.
> > > > > during system shutdown/reboot), then the polling times out. That
> > > > > timeout is of a long duration (10s). And it is repeated multiple
> > > > > times (x3). And that is all done after the watchdogd has been
> > > > > disabled, meaning that system watchdogs are not being kicked.
> > > > > >
> > > > > > If the MMC polling exceeds the watchdog trip time, then the
> > > > > > system will be
> > > > > ungraciously reset. Or in the case of a pretimeout capable
> > > > > watchdog, the pretimeout will trip unnecessarily.
> > > > > >
> > > > > >     [   46.872767] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg
> > > > > 03220301 flags 0000049d
> > > > > >     [   46.880258] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000001
> > > > > >     [   46.886082] sdhci_irq:3558: mmc1: sdhci: IRQ status 0x00000002
> > > > > >     [   46.891906] mmc_request_done:187: mmc1: req done (CMD6):
> 0:
> > > > > 00000800 00000000 00000000 00000000
> > > > > >     [   46.900606] mmc_set_ios:892: mmc1: clock 0Hz busmode 2
> > > > > powermode 0 cs 0 Vdd 0 width 1 timing 0
> > > > > >     [   46.914934] mmc_mrq_pr_debug:274: mmc1: starting CMD13
> arg
> > > > > 00010000 flags 00000195
> > > > > >     [   57.433351] mmc1: Timeout waiting for hardware cmd interrupt.
> > > > >
> > > > > Hmm. How is the polling being done? By using
> MMC_CAP_NEEDS_POLL?
> > > > >
> > > >
> > > > Correct. (At least in my understanding.) 'broken-cd' in the device
> > > > tree will trigger MMC_CAP_NEEDS_POLL to be set.
> > > >
> > > > > The above certainly looks weird. The mmc_rescan work should not
> > > > > be allowed to run at this point, as the work is getting punted
> > > > > to the system_freezable_wq via mmc_schedule_delayed_work().
> > > > >
> > > >
> > > > This is the backtrace I get when the timeout occurs:
> > > >
> > > > [   46.154348] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg
> > > 00010000 flags 00000195
> > > > [   46.161917] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001
> > > > [   46.167743] mmc_request_done:187: mmc1: req done (CMD13): 0:
> > > 00000900 00000000 00000000 00000000
> > > > [   46.176526] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg
> > > 03220301 flags 0000049d
> > > > [   46.184016] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001
> > > > [   46.189840] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002
> > > > [   46.195665] mmc_request_done:187: mmc1: req done (CMD6): 0:
> > > 00000800 00000000 00000000 00000000
> > > > [   46.204362] mmc_set_ios:892: mmc1: clock 0Hz busmode 2
> powermode 0
> > > cs 0 Vdd 0 width 1 timing 0
> > > > [   46.219307] CPU: 6 PID: 153 Comm: kworker/6:1 Tainted: G           O
> > > 6.6.59 #1
> > > > [   46.231988] Hardware name: [Redacted]
> > > > [   46.238678] Workqueue: events_freezable mmc_rescan
> > > > [   46.243466] Call trace:
> > > > [   46.245901]  dump_backtrace+0x9c/0x128
> > > > [   46.249643]  show_stack+0x20/0x38
> > > > [   46.252948]  dump_stack_lvl+0x48/0x60
> > > > [   46.256603]  dump_stack+0x18/0x28
> > > > [   46.259909]  mmc_alive+0x74/0x88
> > > > [   46.263128]  _mmc_detect_card_removed+0x3c/0x158
> > > > [   46.267735]  mmc_detect+0x30/0x98
> > > > [   46.271040]  mmc_rescan+0x94/0x238
> > > > [   46.274432]  process_one_work+0x178/0x3d8
> > > > [   46.278432]  worker_thread+0x2bc/0x3e0
> > > > [   46.282171]  kthread+0xf8/0x110
> > > > [   46.285303]  ret_from_fork+0x10/0x20
> > > > [   46.288878] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg
> > > 00010000 flags 00000195
> > > > [   56.793379] mmc1: Timeout waiting for hardware cmd interrupt.
> > > > [   56.799116] mmc1: sdhci: ============ SDHCI REGISTER DUMP
> > > ===========
> > > > [   56.805545] mmc1: sdhci: Sys addr:  0x00000000 | Version:
> 0x00002202
> > > > ...
> > > >
> > >
> > > Okay. If this is system suspend, it looks like the workqueue didn't
> > > become frozen as exepcted. I don't know why, but this needs to be
> investigated.
> > >
> > > If this is a shutdown, mmc_host_classdev_shutdown() should be called
> > > to "disable" the mmc_rescan work from running the code causing the
> above.
> > >
> > > What am I missing?
> > >
> > > Kind regards
> > > Uffe
> > >
> >
> > This occurs in response to a 'reboot' command.
> >
> > Here is an instance of the failure, with initcall_debug set to 'Y':
> >
> > Rebooting...
> > [   44.500858] caam_jr 8030000.jr: shutdown
> > [   44.504805] caam_jr 8020000.jr: shutdown
> > [   44.508748] caam_jr 8010000.jr: shutdown
> > [   44.515349] tpm tpm0: shutdown_pre
> > [   44.523165] igb 0000:01:00.0: shutdown
> > [   44.558695] usb usb2-port1: shutdown
> > [   44.562268] usb usb1-port1: shutdown
> > [   44.565864] xhci-hcd xhci-hcd.0.auto: shutdown
> > [   44.570306] dwc3 3100000.usb: shutdown
> > [   44.574052] [Redacted] 2030000.i2c:cpld@41:watchdog: shutdown
> > [   44.580137] tpm_tis_spi spi2.0: shutdown
> > [   44.584054] spi-nor spi0.1: shutdown
> > [   44.588066] spi-nor spi0.0: shutdown
> > [   44.592112] mmcblk mmc0:0001: shutdown
> > [   44.595902] mmc_mrq_pr_debug:274: mmc0: starting CMD7 arg
> 00000000 flags 00000000
> > [   44.603413] sdhci_irq:3546: mmc0: sdhci: IRQ status 0x00000001
> > [   44.609240] mmc_request_done:187: mmc0: req done (CMD7): 0:
> 00000000 00000000 00000000 00000000
> > [   44.617939] mmc_set_ios:892: mmc0: clock 0Hz busmode 2 powermode 0
> cs 0 Vdd 0 width 1 timing 0
> > [   44.627877] mmcblk mmc1:0001: shutdown
> > [   44.631709] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg
> 03200101 flags 00000095
> > [   44.639200] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001
> > [   44.645027] mmc_request_done:187: mmc1: req done (CMD6): 0:
> 00000800 00000000 00000000 00000000
> > [   44.653727] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg
> 00010000 flags 00000195
> > [   44.661296] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001
> > [   44.667123] mmc_request_done:187: mmc1: req done (CMD13): 0:
> 00000900 00000000 00000000 00000000
> > [   44.675907] mmc_mrq_pr_debug:274: mmc1: starting CMD6 arg
> 03220301 flags 0000049d
> > [   44.683399] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000001
> > [   44.689223] sdhci_irq:3546: mmc1: sdhci: IRQ status 0x00000002
> > [   44.695047] mmc_request_done:187: mmc1: req done (CMD6): 0:
> 00000800 00000000 00000000 00000000
> > [   44.703746] mmc_set_ios:892: mmc1: clock 0Hz busmode 2 powermode 0
> cs 0 Vdd 0 width 1 timing 0
> > [   44.713667] aer 0000:00:00.0:pcie002: shutdown
> > [   44.713673] mmc_mrq_pr_debug:274: mmc1: starting CMD13 arg
> 00010000 flags 00000195
> > [   44.718102] pcie_pme 0000:00:00.0:pcie001: shutdown
> > [   44.730534] pcieport 0000:00:00.0: shutdown
> > [   44.734898] fsl_mc_dprc dprc.1: shutdown
> > [   44.740863] fsl_mc_dprc dprc.1: DPRC device shutdown
> > [   44.745830] mmc_host mmc1: shutdown_pre
> > [   54.873286] mmc1: Timeout waiting for hardware cmd interrupt.
> > [   54.879023] mmc1: sdhci: ============ SDHCI REGISTER DUMP
> ===========
> > [   54.885453] mmc1: sdhci: Sys addr:  0x00000000 | Version:  0x00002202
> >
> > The trigger CMD13 is in-flight at time 44.713673.
> > The mmc_host mmc1: shutdown_pre is at time 44.745830.
>
> Thanks a lot for providing the details! Yes, this certainly points out the
> problem during shutdown/reboot.
>
> I have sent a patch [1] that I think should address the problem. Can you
> please test and verify it?
>
> [...]
>
> Kind regards
> Uffe
>
> [1]
> https://lore.ke/
> rnel.org%2Fall%2F20241125122446.18684-1-
> ulf.hansson%40linaro.org%2F&data=05%7C02%7Canthony.pighin%40nokia.co
> m%7C2b25985f3d974df1d15e08dd0d4cf275%7C5d4717519675428d917b70f4
> 4f9630b0%7C0%7C0%7C638681346376209654%7CUnknown%7CTWFpbGZsb
> 3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkF
> OIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XQ%2B3SRi8%2B
> F9NZLEuhmekipj5YrjxVhTnlXIP2RUokmg%3D&reserved=0

I confirm that the patch corrects the behaviour.

Thank you!

Anthony