Re: mmc0: Got data interrupt 0x04000000 even though no data operation was in progress.

Gratian Crisan <gratian.crisan@xxxxxx> · Tue, 06 Aug 2024 16:35:11 -0500

Adrian Hunter <adrian.hunter@xxxxxxxxx> writes:
> On 6/08/24 00:33, Gratian Crisan wrote:
>> 
>> We are getting the following splat on latest 6.11.0-rc2-00002-gc813111d19e6 (and
>> older) kernel(s):
>
> Do you know a kernel version that does not get an error?
>

Sorry for not being more clear in my original email - this is not a new issue. I
believe this Bay Trail hardware always had an issue with receiving "Tuning
Error" interrupts with certain SD cards. At least as far back as 4.9.47.

Up until commit b3855668d98c ("mmc: sdhci: Add support for "Tuning Error"
interrupts") these resulted in a "mmc0: Unexpected interrupt 0x04000000" splat,
which b3855668d98c fixed.

However, now that "Tuning Error" interrupts are treated as data interrupts and
handled in sdhci_data_irq() we are hitting a corner case where that tuning error
interrupt comes in after a MMC_SEND_STATUS command with no 'host->data'
associated resulting in the new splat.

Hence the question in my previous email: Should the tuning error interrupts be
handled in common code in sdhci_irq()?

>> 
>> [    4.792991] mmc0: new ultra high speed DDR50 SDHC card at address 0001
>> [    4.793550]   with environment:
>> [    4.793786]     HOME=/
>> [    4.793985]     TERM=linux
>> [    4.794201]     BOOT_IMAGE=/runmode/bzImage
>> [    4.794485]     sys_reset=false
>> [    4.795791] mmcblk0: mmc0:0001 0016G 15.2 GiB
>> [    5.333153] mmc0: Got data interrupt 0x04000000 even though no data operation was in progress.
>> [    5.333676] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>> [    5.334069] mmc0: sdhci: Sys addr:  0x12454200 | Version:  0x0000b502
>> [    5.334464] mmc0: sdhci: Blk size:  0x00007040 | Blk cnt:  0x00000001
>> [    5.334860] mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000010
>> [    5.335253] mmc0: sdhci: Present:   0x01ff0000 | Host ctl: 0x00000016
>> [    5.335648] mmc0: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
>> [    5.336040] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000107
>> [    5.336432] mmc0: sdhci: Timeout:   0x0000000a | Int stat: 0x00000000
>> [    5.336824] mmc0: sdhci: Int enab:  0x03ff008b | Sig enab: 0x03ff008b
>> [    5.337214] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>> [    5.337605] mmc0: sdhci: Caps:      0x076864b2 | Caps_1:   0x00000004
>> [    5.337997] mmc0: sdhci: Cmd:       0x00000d1a | Max curr: 0x00000000
>> [    5.338389] mmc0: sdhci: Resp[0]:   0x00400900 | Resp[1]:  0x00000000
>> [    5.338780] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>> [    5.339170] mmc0: sdhci: Host ctl2: 0x0000000c
>> [    5.339468] mmc0: sdhci: ADMA Err:  0x00000003 | ADMA Ptr: 0x12454200
>> [    5.339859] mmc0: sdhci: ============================================
>> [    5.340293] I/O error, dev mmcblk0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
>> [    5.344663] Buffer I/O error on dev mmcblk0, logical block 0, async page read
>> [    5.346127]  mmcblk0: p1 p2
>> 
>> This is on an Intel Bay Trail based system: NI cRIO-9053 using an Atom E3805.
>> 
>> The issue appears related to the one fixed by commit b3855668d98c ("mmc: sdhci:
>> Add support for "Tuning Error" interrupts") and discussed here[1].
>
> Does reverting that commit help?
>

Reverting the commit brings back the original splat that commit fixed (albeit
without the I/O error):

[    4.893032] mmc0: new ultra high speed DDR50 SDHC card at address 0001
[    4.896238] mmcblk0: mmc0:0001 0016G 15.2 GiB
[    4.905944] mmc0: Unexpected interrupt 0x04000000.
[    4.906272] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[    4.906664] mmc0: sdhci: Sys addr:  0x126e6200 | Version:  0x0000b502
[    4.907059] mmc0: sdhci: Blk size:  0x00007200 | Blk cnt:  0x00000008
[    4.907451] mmc0: sdhci: Argument:  0x00000000 | Trn mode: 0x0000003b
[    4.907842] mmc0: sdhci: Present:   0x01ff0206 | Host ctl: 0x00000016
[    4.908234] mmc0: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
[    4.908625] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x00000107
[    4.909015] mmc0: sdhci: Timeout:   0x0000000a | Int stat: 0x00000002
[    4.909408] mmc0: sdhci: Int enab:  0x03ff008b | Sig enab: 0x03ff008b
[    4.909800] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000001
[    4.910193] mmc0: sdhci: Caps:      0x076864b2 | Caps_1:   0x00000004
[    4.910581] mmc0: sdhci: Cmd:       0x0000123a | Max curr: 0x00000000
[    4.910976] mmc0: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00400900
[    4.911371] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00400900
[    4.911765] mmc0: sdhci: Host ctl2: 0x0000000c
[    4.912064] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0x126e6200
[    4.912456] mmc0: sdhci: ============================================
[    4.913301]  mmcblk0: p1 p2
[    6.401855] EXT4-fs (mmcblk1p2): mounted filesystem d57a3d3c-a1f9-4f8e-8cbc-19dc5bb4fc4c r/w with ordered data mode. Quota mode: disabled.

>> I am new to this area of the kernel so I would appreciate any suggestions on the
>> direction to take here:
>> 
>>   - Should the tuning error interrupts be handled in common code in sdhci_irq()
>>     (or at least before the !host->data check in sdhci_data_irq())?
>> 
>>   - Is this more of an issue with tuning not happening when is expected or
>>     taking too long, since at first we do get the error during data transfer
>>     commands? Suggestions on what I should debug/trace next appreciated.
>
> SDHCI driver does not enable the "Tuning Error" interrupt, refer
> the kernel messages above:
>
> 	Int enab:  0x03ff008b | Sig enab: 0x03ff008b
>
> but it happens anyway, so the "fix" was to handle it anyway.
>
> But it begs the question, wasn't the error happening already?
>

Kind of: Before we were getting "mmc0: Unexpected interrupt 0x04000000", but
somehow it didn't result in a I/O error. That may be just lucky timing.

Now we're getting "mmc0: Got data interrupt 0x04000000 even though no data operation was in
progress." followed by an I/O error on READ.

I appreciate your reply. I'm happy to work on a patch or test things if I'm
pointed in the right direction.

Thanks,
    Gratian