RE: EXT: Re: mmc0: Timeout waiting for hardware cmd interrupt on i.MX535

"Baumgartner, Claus (GE Healthcare)" <claus.baumgartner@xxxxxxxxxx> · Mon, 7 Sep 2020 10:50:45 +0000

Hi,

> -----Original Message-----
> From: Sebastian Reichel <sebastian.reichel@xxxxxxxxxxxxx>
> Sent: 3. syyskuutata 2020 16:29
> To: Bough Chen <haibo.chen@xxxxxxx>
> Cc: dl-linux-imx <linux-imx@xxxxxxx>; linux-mmc@xxxxxxxxxxxxxxx; Shawn
> Guo <shawnguo@xxxxxxxxxx>; Sascha Hauer <s.hauer@xxxxxxxxxxxxxx>;
> Pengutronix Kernel Team <kernel@xxxxxxxxxxxxxx>; Fabio Estevam
> <festevam@xxxxxxxxx>; Baumgartner, Claus (GE Healthcare)
> <claus.baumgartner@xxxxxxxxxx>
> Subject: EXT: Re: mmc0: Timeout waiting for hardware cmd interrupt on
> i.MX535
> 
> Hi,
> 
> On Thu, Sep 03, 2020 at 02:10:43AM +0000, Bough Chen wrote:
> > > -----Original Message-----
> > > From: Sebastian Reichel [mailto:sebastian.reichel@xxxxxxxxxxxxx]
> > > Sent: 2020年9月2日 21:49
> > > To: Bough Chen <haibo.chen@xxxxxxx>
> > > Cc: dl-linux-imx <linux-imx@xxxxxxx>; linux-mmc@xxxxxxxxxxxxxxx;
> > > Shawn Guo <shawnguo@xxxxxxxxxx>; Sascha Hauer
> > > <s.hauer@xxxxxxxxxxxxxx>; Pengutronix Kernel Team
> > > <kernel@xxxxxxxxxxxxxx>; Fabio Estevam <festevam@xxxxxxxxx>;
> > > Baumgartner, Claus (GE Healthcare) <claus.baumgartner@xxxxxxxxxx>
> > > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on
> > > i.MX535
> > >
> > > On Wed, Sep 02, 2020 at 11:24:52AM +0000, Bough Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Sebastian Reichel [mailto:sebastian.reichel@xxxxxxxxxxxxx]
> > > > > Sent: 2020年9月1日 19:47
> > > > > To: dl-linux-imx <linux-imx@xxxxxxx>
> > > > > Cc: linux-mmc@xxxxxxxxxxxxxxx; Bough Chen <haibo.chen@xxxxxxx>;
> > > > > Shawn Guo <shawnguo@xxxxxxxxxx>; Sascha Hauer
> > > > > <s.hauer@xxxxxxxxxxxxxx>; Pengutronix Kernel Team
> > > > > <kernel@xxxxxxxxxxxxxx>; Fabio Estevam <festevam@xxxxxxxxx>;
> > > > > Baumgartner, Claus (GE Healthcare)
> > > > > <claus.baumgartner@xxxxxxxxxx>
> > > > > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on
> > > > > i.MX535
> > > > >
> > > > > Hi,
> > > > >
> > > > > [add i.MX architecture maintainers to Cc]
> > > > >
> > > > > On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE
> > > > > Healthcare) wrote:
> > > > > > We have a board with an i.MX535 using a Samsung eMMC as
> > > > > > persistent storage connected to eSDHCv3. Every now and then we
> > > > > > produce a build that causes emmc timeouts:
> > > > > >
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for
> > > > > > hardware cmd
> > > > > interrupt.
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI
> > > > > > REGISTER DUMP =========== Aug 28 07:32:12 csmon kernel: mmc0:
> > > sdhci: Sys addr:
> > > > > > 0xe3f12000 | Version:  0x00001201 Aug 28 07:32:12 csmon kernel:
> > > mmc0:
> > > > > > sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001 Aug 28
> > > > > > 07:32:12 csmon
> > > > > kernel: mmc0: sdhci: Argument:  0x00010000 | Trn mode:
> > > > > 0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present:   0x01f80008 |
> > > Host
> > > > > ctl: 0x00000031
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power:     0x00000002 |
> > > Blk
> > > > > gap:  0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up:   0x00000000
> |
> > > > > Clock:    0x0000011f
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout:   0x0000008e |
> > > Int
> > > > > stat: 0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab:
> > > > > > 0x107f000b | Sig
> > > > > > enab: 0x107f000b Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > > > > > ACmd
> > > stat:
> > > > > 0x00000000 | Slot int: 0x00001201
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps:      0x07eb0000 |
> > > > > Caps_1:   0x08100810
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd:       0x00000d1a |
> > > > > Max curr: 0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]:   0x00400900 |
> > > > > Resp[1]:  0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]:   0x00000000 |
> > > > > Resp[3]:  0x00000000
> > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2:
> > > > > > 0x00000000 Aug
> > > > > > 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err:  0x00000000 |
> > > > > > ADMA
> > > > > > Ptr: 0xef041208 Aug 28 07:32:12 csmon kernel: mmc0: sdhci:
> > > > > > ============================================
> > > > >
> > > > > Some extra information: The timeout always has cmd = 0x00000d1a
> > > > > (MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0]
> > > > > translating to this IIUIC:
> > > > >
> > > > > Bit 8 = Ready for data
> > > > > Bit 11 = CURRENT_STATE is TRAN
> > > > > Bit 22 = Illegal command
> > > >
> > > > According to the code logic, since this cmd13 get hardware cmd
> > > > timeout, which means this cmd13 do not get any response. Here the
> > > > Resp[0] should be the previous command's response.
> > > >
> > > > So this means the previous command is an illegal command, cause
> > > > the emmc internal firmware stuck, and can't response to the next cmd13.
> > > >
> > > > I think we need to firstly identify the specific place in emmc
> > > > driver which trigger the log dump.
> > >
> > > My understanding is, that a missing response from the eMMC should
> > > trigger the Command Timeout Error Status IRQ in eSDHC after 64 SDCLK
> > > cycles (see section 30.7.10 [ESDHCV3x_IRQSTAT] in the i.MX53 reference
> manual).
> > > 64 SDCLK cycles means, that this should recover quickly and would
> > > not be a problem for most usecases.
> > >
> > > But what we are seeing is the software 10 seconds timeout. My
> > > understanding is, that this should not be triggered if the SDHCI
> > > controller works as expected (e.g. by generating a IRQ for the
> > > timeout). This timeout is much more problematic, since all eMMC accessing
> processes block for those 10 seconds.
> > >
> >
> > Agree, only one possibility, the cmd13 do not send out successfully.
> 
> I think there are two possibilities:
> 
> 1. The command is not send out, so no IRQs are received.
> 2. The IRQ gets lost or is not generated
> 
> The esdhc_writel_le() has a workaround to avoid missing the card irq. If that
> does not fully fix the issue, I would expect the SW fallback to also cover that
> case.
> 
> > The count of 64 SDCLK cycle only trigged by the end of the sending
> > command. If the command still not send out completely, then should
> > trigger the 10s sw timeout. Let me double confirm with our IC team.
> 
> Ack.
> 
> > I still suggest that we need first to find which cmd13 in our mmc
> > driver meet this issue.
> 
> We will try to figure that out and report back. Needs a bit of time, since the
> error only appears after some hours on an affected kernel and adding the
> necessary code potentially hides the problem due to the alignment changes
> requiring another run with padding nops.
> 
> -- Sebastian

I changed the software so that it generates a back trace when the timeout occurs
and I had to pad with a few NOP instructions again to get the timeouts to occur.

I run the software on two units and got all together 22 timeouts during 2 days.  All
back traces were same as below:  

[ 2178.405338] mmc0: Timeout waiting for hardware cmd interrupt.
[ 2178.405354] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
[ 2178.405367] mmc0: sdhci: Sys addr:  0xe5f1a000 | Version:  0x00001201
[ 2178.405374] mmc0: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000001
[ 2178.405380] mmc0: sdhci: Argument:  0x00010000 | Trn mode: 0x00000000
[ 2178.405389] mmc0: sdhci: Present:   0x01f80008 | Host ctl: 0x00000031
[ 2178.405395] mmc0: sdhci: Power:     0x00000002 | Blk gap:  0x00000000
[ 2178.405401] mmc0: sdhci: Wake-up:   0x00000000 | Clock:    0x0000011f
[ 2178.405407] mmc0: sdhci: Timeout:   0x0000008e | Int stat: 0x00000000
[ 2178.405413] mmc0: sdhci: Int enab:  0x107f000b | Sig enab: 0x107f000b
[ 2178.405420] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00001201
[ 2178.405426] mmc0: sdhci: Caps:      0x07eb0000 | Caps_1:   0x08100810
[ 2178.405432] mmc0: sdhci: Cmd:       0x00000d1a | Max curr: 0x00000000
[ 2178.405438] mmc0: sdhci: Resp[0]:   0x00400900 | Resp[1]:  0x00000000
[ 2178.405444] mmc0: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[ 2178.405449] mmc0: sdhci: Host ctl2: 0x00000000
[ 2178.405456] mmc0: sdhci: ADMA Err:  0x00000000 | ADMA Ptr: 0xef041208
[ 2178.405460] mmc0: sdhci: ============================================
[ 2178.405681] CPU: 0 PID: 140 Comm: kworker/0:4H Not tainted 5.8.3 #1
[ 2178.405689] Hardware name: Freescale i.MX53 (Device Tree Support)
[ 2178.405715] Workqueue: mmc_complete mmc_blk_mq_complete_work
[ 2178.405723] Backtrace: 
[ 2178.405745] [<8010e0bc>] (dump_backtrace) from [<8010e480>] (show_stack+0x20/0x24)
[ 2178.405755]  r7:00000005 r6:200b0013 r5:00000000 r4:80f46ff4
[ 2178.405775] [<8010e460>] (show_stack) from [<8046c024>] (dump_stack+0xa4/0xb8)
[ 2178.405795] [<8046bf80>] (dump_stack) from [<80765890>] (mmc_wait_for_cmd+0x94/0xd0)
[ 2178.405804]  r7:00000005 r6:80f04f48 r5:00000000 r4:ed7abe14
[ 2178.405817] [<807657fc>] (mmc_wait_for_cmd) from [<8076c338>] (__mmc_send_status+0x78/0xa8)
[ 2178.405826]  r8:00000000 r7:00000005 r6:ed7abe68 r5:ee137800 r4:80f04f48
[ 2178.405844] [<8076c2c0>] (__mmc_send_status) from [<80775d7c>] (card_busy_detect+0x5c/0xf8)
[ 2178.405853]  r7:ee137800 r6:0002ddba r5:80f03d00 r4:0002d9d2
[ 2178.405864] [<80775d20>] (card_busy_detect) from [<80778d68>] (mmc_blk_mq_complete_prev_req.part.4+0x154/0x254)
[ 2178.405874]  r10:ee137800 r9:ee3034a8 r8:00000000 r7:ee21eec4 r6:80f04f48 r5:ee303400
[ 2178.405879]  r4:ee21ee08
[ 2178.405889] [<80778c14>] (mmc_blk_mq_complete_prev_req.part.4) from [<80779704>] (mmc_blk_mq_complete_work+0x30/0x34)
[ 2178.405899]  r10:80faaa50 r9:00000000 r8:00000000 r7:eefe4800 r6:eefdcb00 r5:ed5a4700
[ 2178.405905]  r4:ee21eed8
[ 2178.405917] [<807796d4>] (mmc_blk_mq_complete_work) from [<8014221c>] (process_one_work+0x238/0x590)
[ 2178.405926] [<80141fe4>] (process_one_work) from [<80142a20>] (worker_thread+0x60/0x5b8)
[ 2178.405936]  r10:eefdcb00 r9:80f03d00 r8:eefdcb18 r7:00000008 r6:eefdcb00 r5:ed5a4714
[ 2178.405941]  r4:ed5a4700
[ 2178.405956] [<801429c0>] (worker_thread) from [<8014957c>] (kthread+0x170/0x174)
[ 2178.405966]  r10:ed92de74 r9:ed5a4700 r8:801429c0 r7:ed7aa000 r6:00000000 r5:edbd8ac0
[ 2178.405971]  r4:eda67180
[ 2178.405982] [<8014940c>] (kthread) from [<80100128>] (ret_from_fork+0x14/0x2c)
[ 2178.405989] Exception stack(0xed7abfb0 to 0xed7abff8)
[ 2178.405997] bfa0:                                     00000000 00000000 00000000 00000000
[ 2178.406007] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 2178.406015] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
[ 2178.406027]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:8014940c
[ 2178.406032]  r4:edbd8ac0

-Claus-

> 
> > > > > > Timeouts do not occur with every build. After some debugging I
> > > > > > have found that timeouts seem to depend on code alignment of
> > > > > > the esdhc_readl_le function. I have bisected the behavior by
> > > > > > using the System.map and moving/padding the code with NOP
> > > > > > instructions (mov r0,r0).
> > > > > >
> > > > > > My test case has 5 processes continuously creating a file,
> > > > > > writing random long data, reading data and deleting the file.
> > > > > > It seems that when the esdhc_writel_le is aligned on a certain
> > > > > > address then the timeout will occur about 5 times/12h using my
> > > > > > test case. If I add one more NOP, the timeout will not occur
> > > > > > at all. If I continue adding some more NOPs the timeouts come
> > > > > > back. Seems that it doesn't matter where in the code I add
> > > > > > NOPs as long as the address is below the address of esdhc_writel_le.
> > > > > >
> > > > > > We also run the same software on a dual core i.MX6 without any
> > > > > > timeout issues.
> > > > >
> > > > > And the same kernel binary is also used on an i.MX6 single core
> > > > > (albeit with different SW) withot triggering the problem so far.
> > > > >
> > > > > > I have reproduced this with kernel version 4.19.94 and 5.8.3
> > > > > > and we have compiled with both gcc8 and gcc9. I'm still
> > > > > > searching for the root cause and I would appreciate any
> > > > > > thoughts about where to go
> > > next.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > -Claus-
> > > > >
> > > > > To me it looks like it might involve an unknown hardware errata
> > > > > for i.MX53, but there has been one similar report before
> > > > > (unfortunately without the full register dump) involving virtualization:
> > > > >
> > > > > https://patchwork.kernel.org/patch/10705823/
> > > > >
> > > > > Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y.
> > > > >
> > > > > -- Sebastian