mmc0: Timeout waiting for hardware cmd interrupt on i.MX535

"Baumgartner, Claus (GE Healthcare)" <claus.baumgartner@xxxxxxxxxx> · Tue, 1 Sep 2020 07:37:31 +0000

Hi,

We have a board with an i.MX535 using a Samsung eMMC as persistent storage connected to eSDHCv3. Every now and then we produce a build that causes emmc timeouts: 

Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for hardware cmd interrupt.
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Sys addr:  0xe3f12000 | Version:  0x00001201
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 | Host ctl: 0x00000031
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 | Blk gap:  0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x0000011f
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e | Int stat: 0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:  0x107f000b | Sig enab: 0x107f000b
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00001201
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 | Caps_1:   0x08100810
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a | Max curr: 0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 | Resp[1]:  0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: 0x00000000
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0xef041208
Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============================================

Timeouts do not occur with every build. After some debugging I have found that timeouts seem to depend on code alignment of the esdhc_readl_le function. I have bisected the behavior by using the System.map and moving/padding the code with NOP instructions (mov r0,r0).

My test case has 5 processes continuously creating a file, writing random long data, reading data and deleting the file. It seems that when the esdhc_writel_le is aligned on a certain address then the timeout will occur about 5 times/12h using my test case. If I add one more NOP, the timeout will not occur at all. If I continue adding some more NOPs the timeouts come back. Seems that it doesn't matter where in the code I add NOPs as long as the address is below the address of esdhc_writel_le. 

We also run the same software on  a dual core i.MX6 without any timeout issues.

I have reproduced this with kernel version 4.19.94 and 5.8.3 and we have compiled with both gcc8 and gcc9.
I'm still searching for the root cause and I would appreciate any thoughts about where to go next. 

Thanks,

-Claus-