Re: mmc0: Got data interrupt 0x04000000 even though no data operation was in progress.

Adrian Hunter <adrian.hunter@xxxxxxxxx> · Tue, 6 Aug 2024 10:31:22 +0300

On 6/08/24 00:33, Gratian Crisan wrote:
> Hi all,
> 
> We are getting the following splat on latest 6.11.0-rc2-00002-gc813111d19e6 (and
> older) kernel(s):

Do you know a kernel version that does not get an error?

> 
> [    4.792991] mmc0: new ultra high speed DDR50 SDHC card at address 0001
> [    4.793550]   with environment:
> [    4.793786]     HOME=/
> [    4.793985]     TERM=linux
> [    4.794201]     BOOT_IMAGE=/runmode/bzImage
> [    4.794485]     sys_reset=false
> [    4.795791] mmcblk0: mmc0:0001 0016G 15.2 GiB
> [    5.333153] mmc0: Got data interrupt 0x04000000 even though no data operation was in progress.
> [    5.333676] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> [    5.334069] mmc0: sdhci: Sys addr:  0x12454200 | Version:  0x0000b502
> [    5.334464] mmc0: sdhci: Blk size:  0x00007040 | Blk cnt:  0x00000001
> [    5.334860] mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000010
> [    5.335253] mmc0: sdhci: Present:   0x01ff0000 | Host ctl: 0x00000016
> [    5.335648] mmc0: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
> [    5.336040] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000107
> [    5.336432] mmc0: sdhci: Timeout:   0x0000000a | Int stat: 0x00000000
> [    5.336824] mmc0: sdhci: Int enab:  0x03ff008b | Sig enab: 0x03ff008b
> [    5.337214] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
> [    5.337605] mmc0: sdhci: Caps:      0x076864b2 | Caps_1:   0x00000004
> [    5.337997] mmc0: sdhci: Cmd:       0x00000d1a | Max curr: 0x00000000
> [    5.338389] mmc0: sdhci: Resp[0]:   0x00400900 | Resp[1]:  0x00000000
> [    5.338780] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
> [    5.339170] mmc0: sdhci: Host ctl2: 0x0000000c
> [    5.339468] mmc0: sdhci: ADMA Err:  0x00000003 | ADMA Ptr: 0x12454200
> [    5.339859] mmc0: sdhci: ============================================
> [    5.340293] I/O error, dev mmcblk0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> [    5.344663] Buffer I/O error on dev mmcblk0, logical block 0, async page read
> [    5.346127]  mmcblk0: p1 p2
> 
> This is on an Intel Bay Trail based system: NI cRIO-9053 using an Atom E3805.
> 
> The issue appears related to the one fixed by commit b3855668d98c ("mmc: sdhci:
> Add support for "Tuning Error" interrupts") and discussed here[1].

Does reverting that commit help?

> 
> After adding some debug prints it appears that in our case we get a tuning error
> interrupt during a MMC_SEND_STATUS (13) sdhci cmd which has no 'host->data'
> associated with it (leading to the splat):
> 
> [    4.893298] mmc0: new ultra high speed DDR50 SDHC card at address 0001
> [    4.896489] mmcblk0: mmc0:0001 0016G 15.2 GiB
> [    4.906048] mmc0: tuning err irq, sdhci cmd: 18, host->cmd: 0000000003b39249, host->data: 00000000c0b4ad8a
> [    4.963027] mmc0: tuning err irq, sdhci cmd: 18, host->cmd: 0000000003b39249, host->data: 00000000c0b4ad8a
> [    5.384960] mmc0: tuning err irq, sdhci cmd: 17, host->cmd: 0000000003b39249, host->data: 00000000c0b4ad8a
> [    5.442877] mmc0: tuning err irq, sdhci cmd: 13, host->cmd: 00000000e1669bad, host->data: 0000000000000000
> [    5.443463] mmc0: Got data interrupt 0x04000000 even though no data operation was in progress.
> 
> I am new to this area of the kernel so I would appreciate any suggestions on the
> direction to take here:
> 
>   - Should the tuning error interrupts be handled in common code in sdhci_irq()
>     (or at least before the !host->data check in sdhci_data_irq())?
> 
>   - Is this more of an issue with tuning not happening when is expected or
>     taking too long, since at first we do get the error during data transfer
>     commands? Suggestions on what I should debug/trace next appreciated.

SDHCI driver does not enable the "Tuning Error" interrupt, refer
the kernel messages above:

	Int enab:  0x03ff008b | Sig enab: 0x03ff008b

but it happens anyway, so the "fix" was to handle it anyway.

But it begs the question, wasn't the error happening already?

> 
> Thanks,
>     Gratian
> 
> [1] https://lore.kernel.org/r/20240410191639.526324-3-hdegoede@xxxxxxxxxx