Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535

Sebastian Reichel <sebastian.reichel@xxxxxxxxxxxxx> · Wed, 2 Sep 2020 15:49:06 +0200

Hi,

On Wed, Sep 02, 2020 at 11:24:52AM +0000, Bough Chen wrote:
> > -----Original Message-----
> > From: Sebastian Reichel [mailto:sebastian.reichel@xxxxxxxxxxxxx]
> > Sent: 2020年9月1日 19:47
> > To: dl-linux-imx <linux-imx@xxxxxxx>
> > Cc: linux-mmc@xxxxxxxxxxxxxxx; Bough Chen <haibo.chen@xxxxxxx>; Shawn
> > Guo <shawnguo@xxxxxxxxxx>; Sascha Hauer <s.hauer@xxxxxxxxxxxxxx>;
> > Pengutronix Kernel Team <kernel@xxxxxxxxxxxxxx>; Fabio Estevam
> > <festevam@xxxxxxxxx>; Baumgartner, Claus (GE Healthcare)
> > <claus.baumgartner@xxxxxxxxxx>
> > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
> > 
> > Hi,
> > 
> > [add i.MX architecture maintainers to Cc]
> > 
> > On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE
> > Healthcare) wrote:
> > > We have a board with an i.MX535 using a Samsung eMMC as persistent
> > > storage connected to eSDHCv3. Every now and then we produce a build
> > > that causes emmc timeouts:
> > >
> > > Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for hardware cmd
> > interrupt.
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI REGISTER
> > > DUMP =========== Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Sys addr:
> > > 0xe3f12000 | Version:  0x00001201 Aug 28 07:32:12 csmon kernel: mmc0:
> > > sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001 Aug 28 07:32:12 csmon
> > kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 | Host
> > ctl: 0x00000031
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 | Blk
> > gap:  0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000 |
> > Clock:    0x0000011f
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e | Int
> > stat: 0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:  0x107f000b | Sig
> > > enab: 0x107f000b Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ACmd stat:
> > 0x00000000 | Slot int: 0x00001201
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 |
> > Caps_1:   0x08100810
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a |
> > Max curr: 0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 |
> > Resp[1]:  0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 |
> > Resp[3]:  0x00000000
> > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: 0x00000000 Aug
> > > 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 | ADMA
> > > Ptr: 0xef041208 Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > > ============================================
> > 
> > Some extra information: The timeout always has cmd = 0x00000d1a
> > (MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0] translating to
> > this IIUIC:
> > 
> > Bit 8 = Ready for data
> > Bit 11 = CURRENT_STATE is TRAN
> > Bit 22 = Illegal command
> 
> According to the code logic, since this cmd13 get hardware cmd
> timeout, which means this cmd13 do not get any response. Here
> the Resp[0] should be the previous command's response.
>
> So this means the previous command is an illegal command, cause
> the emmc internal firmware stuck, and can't response to the next
> cmd13.
>
> I think we need to firstly identify the specific place in
> emmc driver which trigger the log dump.

My understanding is, that a missing response from the eMMC should trigger
the Command Timeout Error Status IRQ in eSDHC after 64 SDCLK cycles
(see section 30.7.10 [ESDHCV3x_IRQSTAT] in the i.MX53 reference manual).
64 SDCLK cycles means, that this should recover quickly and would not be
a problem for most usecases.

But what we are seeing is the software 10 seconds timeout. My understanding
is, that this should not be triggered if the SDHCI controller works as expected
(e.g. by generating a IRQ for the timeout). This timeout is much more
problematic, since all eMMC accessing processes block for those 10 seconds.

-- Sebastian

> Best Regards
> Haibo Chen
> 
> > 
> > > Timeouts do not occur with every build. After some debugging I have
> > > found that timeouts seem to depend on code alignment of the
> > > esdhc_readl_le function. I have bisected the behavior by using the
> > > System.map and moving/padding the code with NOP instructions (mov
> > > r0,r0).
> > >
> > > My test case has 5 processes continuously creating a file, writing
> > > random long data, reading data and deleting the file. It seems that
> > > when the esdhc_writel_le is aligned on a certain address then the
> > > timeout will occur about 5 times/12h using my test case. If I add one
> > > more NOP, the timeout will not occur at all. If I continue adding some
> > > more NOPs the timeouts come back. Seems that it doesn't matter where
> > > in the code I add NOPs as long as the address is below the address of
> > > esdhc_writel_le.
> > >
> > > We also run the same software on a dual core i.MX6 without any timeout
> > > issues.
> > 
> > And the same kernel binary is also used on an i.MX6 single core (albeit with
> > different SW) withot triggering the problem so far.
> > 
> > > I have reproduced this with kernel version 4.19.94 and 5.8.3 and we
> > > have compiled with both gcc8 and gcc9. I'm still searching for the
> > > root cause and I would appreciate any thoughts about where to go next.
> > >
> > > Thanks,
> > >
> > > -Claus-
> > 
> > To me it looks like it might involve an unknown hardware errata for i.MX53, but
> > there has been one similar report before (unfortunately without the full
> > register dump) involving virtualization:
> > 
> > https://patchwork.kernel.org/patch/10705823/
> > 
> > Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y.
> > 
> > -- Sebastian
Attachment:
signature.asc

Description: PGP signature