Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Thu, Sep 03, 2020 at 02:10:43AM +0000, Bough Chen wrote:
> > -----Original Message-----
> > From: Sebastian Reichel [mailto:sebastian.reichel@xxxxxxxxxxxxx]
> > Sent: 2020年9月2日 21:49
> > To: Bough Chen <haibo.chen@xxxxxxx>
> > Cc: dl-linux-imx <linux-imx@xxxxxxx>; linux-mmc@xxxxxxxxxxxxxxx; Shawn Guo
> > <shawnguo@xxxxxxxxxx>; Sascha Hauer <s.hauer@xxxxxxxxxxxxxx>;
> > Pengutronix Kernel Team <kernel@xxxxxxxxxxxxxx>; Fabio Estevam
> > <festevam@xxxxxxxxx>; Baumgartner, Claus (GE Healthcare)
> > <claus.baumgartner@xxxxxxxxxx>
> > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535
> > 
> > On Wed, Sep 02, 2020 at 11:24:52AM +0000, Bough Chen wrote:
> > > > -----Original Message-----
> > > > From: Sebastian Reichel [mailto:sebastian.reichel@xxxxxxxxxxxxx]
> > > > Sent: 2020年9月1日 19:47
> > > > To: dl-linux-imx <linux-imx@xxxxxxx>
> > > > Cc: linux-mmc@xxxxxxxxxxxxxxx; Bough Chen <haibo.chen@xxxxxxx>;
> > > > Shawn Guo <shawnguo@xxxxxxxxxx>; Sascha Hauer
> > > > <s.hauer@xxxxxxxxxxxxxx>; Pengutronix Kernel Team
> > > > <kernel@xxxxxxxxxxxxxx>; Fabio Estevam <festevam@xxxxxxxxx>;
> > > > Baumgartner, Claus (GE Healthcare) <claus.baumgartner@xxxxxxxxxx>
> > > > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on
> > > > i.MX535
> > > >
> > > > Hi,
> > > >
> > > > [add i.MX architecture maintainers to Cc]
> > > >
> > > > On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE
> > > > Healthcare) wrote:
> > > > > We have a board with an i.MX535 using a Samsung eMMC as persistent
> > > > > storage connected to eSDHCv3. Every now and then we produce a
> > > > > build that causes emmc timeouts:
> > > > >
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for hardware
> > > > > cmd
> > > > interrupt.
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI
> > > > > REGISTER DUMP =========== Aug 28 07:32:12 csmon kernel: mmc0:
> > sdhci: Sys addr:
> > > > > 0xe3f12000 | Version:  0x00001201 Aug 28 07:32:12 csmon kernel:
> > mmc0:
> > > > > sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001 Aug 28
> > > > > 07:32:12 csmon
> > > > kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 |
> > Host
> > > > ctl: 0x00000031
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 |
> > Blk
> > > > gap:  0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000 |
> > > > Clock:    0x0000011f
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e |
> > Int
> > > > stat: 0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:  0x107f000b |
> > > > > Sig
> > > > > enab: 0x107f000b Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ACmd
> > stat:
> > > > 0x00000000 | Slot int: 0x00001201
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 |
> > > > Caps_1:   0x08100810
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a |
> > > > Max curr: 0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 |
> > > > Resp[1]:  0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 |
> > > > Resp[3]:  0x00000000
> > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: 0x00000000
> > > > > Aug
> > > > > 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 |
> > > > > ADMA
> > > > > Ptr: 0xef041208 Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > > > > ============================================
> > > >
> > > > Some extra information: The timeout always has cmd = 0x00000d1a
> > > > (MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0] translating
> > > > to this IIUIC:
> > > >
> > > > Bit 8 = Ready for data
> > > > Bit 11 = CURRENT_STATE is TRAN
> > > > Bit 22 = Illegal command
> > >
> > > According to the code logic, since this cmd13 get hardware cmd
> > > timeout, which means this cmd13 do not get any response. Here the
> > > Resp[0] should be the previous command's response.
> > >
> > > So this means the previous command is an illegal command, cause the
> > > emmc internal firmware stuck, and can't response to the next cmd13.
> > >
> > > I think we need to firstly identify the specific place in emmc driver
> > > which trigger the log dump.
> > 
> > My understanding is, that a missing response from the eMMC should trigger
> > the Command Timeout Error Status IRQ in eSDHC after 64 SDCLK cycles (see
> > section 30.7.10 [ESDHCV3x_IRQSTAT] in the i.MX53 reference manual).
> > 64 SDCLK cycles means, that this should recover quickly and would not be a
> > problem for most usecases.
> > 
> > But what we are seeing is the software 10 seconds timeout. My understanding
> > is, that this should not be triggered if the SDHCI controller works as expected
> > (e.g. by generating a IRQ for the timeout). This timeout is much more
> > problematic, since all eMMC accessing processes block for those 10 seconds.
> > 
> 
> Agree, only one possibility, the cmd13 do not send out successfully.

I think there are two possibilities:

1. The command is not send out, so no IRQs are received.
2. The IRQ gets lost or is not generated

The esdhc_writel_le() has a workaround to avoid missing the card
irq. If that does not fully fix the issue, I would expect the SW
fallback to also cover that case.

> The count of 64 SDCLK cycle only trigged by the end of the sending
> command. If the command still not send out completely, then
> should trigger the 10s sw timeout. Let me double confirm with our
> IC team.

Ack.

> I still suggest that we need first to find which cmd13 in our mmc
> driver meet this issue.

We will try to figure that out and report back. Needs a bit of time,
since the error only appears after some hours on an affected kernel
and adding the necessary code potentially hides the problem due to
the alignment changes requiring another run with padding nops.

-- Sebastian

> > > > > Timeouts do not occur with every build. After some debugging I
> > > > > have found that timeouts seem to depend on code alignment of the
> > > > > esdhc_readl_le function. I have bisected the behavior by using the
> > > > > System.map and moving/padding the code with NOP instructions (mov
> > > > > r0,r0).
> > > > >
> > > > > My test case has 5 processes continuously creating a file, writing
> > > > > random long data, reading data and deleting the file. It seems
> > > > > that when the esdhc_writel_le is aligned on a certain address then
> > > > > the timeout will occur about 5 times/12h using my test case. If I
> > > > > add one more NOP, the timeout will not occur at all. If I continue
> > > > > adding some more NOPs the timeouts come back. Seems that it
> > > > > doesn't matter where in the code I add NOPs as long as the address
> > > > > is below the address of esdhc_writel_le.
> > > > >
> > > > > We also run the same software on a dual core i.MX6 without any
> > > > > timeout issues.
> > > >
> > > > And the same kernel binary is also used on an i.MX6 single core
> > > > (albeit with different SW) withot triggering the problem so far.
> > > >
> > > > > I have reproduced this with kernel version 4.19.94 and 5.8.3 and
> > > > > we have compiled with both gcc8 and gcc9. I'm still searching for
> > > > > the root cause and I would appreciate any thoughts about where to go
> > next.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > -Claus-
> > > >
> > > > To me it looks like it might involve an unknown hardware errata for
> > > > i.MX53, but there has been one similar report before (unfortunately
> > > > without the full register dump) involving virtualization:
> > > >
> > > > https://patchwork.kernel.org/patch/10705823/
> > > >
> > > > Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y.
> > > >
> > > > -- Sebastian

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Memonry Technology]     [Linux USB Devel]     [Linux Media]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux