Hi, > -----Original Message----- > From: Sebastian Reichel <sebastian.reichel@xxxxxxxxxxxxx> > Sent: 3. syyskuutata 2020 16:29 > To: Bough Chen <haibo.chen@xxxxxxx> > Cc: dl-linux-imx <linux-imx@xxxxxxx>; linux-mmc@xxxxxxxxxxxxxxx; Shawn > Guo <shawnguo@xxxxxxxxxx>; Sascha Hauer <s.hauer@xxxxxxxxxxxxxx>; > Pengutronix Kernel Team <kernel@xxxxxxxxxxxxxx>; Fabio Estevam > <festevam@xxxxxxxxx>; Baumgartner, Claus (GE Healthcare) > <claus.baumgartner@xxxxxxxxxx> > Subject: EXT: Re: mmc0: Timeout waiting for hardware cmd interrupt on > i.MX535 > > Hi, > > On Thu, Sep 03, 2020 at 02:10:43AM +0000, Bough Chen wrote: > > > -----Original Message----- > > > From: Sebastian Reichel [mailto:sebastian.reichel@xxxxxxxxxxxxx] > > > Sent: 2020年9月2日 21:49 > > > To: Bough Chen <haibo.chen@xxxxxxx> > > > Cc: dl-linux-imx <linux-imx@xxxxxxx>; linux-mmc@xxxxxxxxxxxxxxx; > > > Shawn Guo <shawnguo@xxxxxxxxxx>; Sascha Hauer > > > <s.hauer@xxxxxxxxxxxxxx>; Pengutronix Kernel Team > > > <kernel@xxxxxxxxxxxxxx>; Fabio Estevam <festevam@xxxxxxxxx>; > > > Baumgartner, Claus (GE Healthcare) <claus.baumgartner@xxxxxxxxxx> > > > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on > > > i.MX535 > > > > > > On Wed, Sep 02, 2020 at 11:24:52AM +0000, Bough Chen wrote: > > > > > -----Original Message----- > > > > > From: Sebastian Reichel [mailto:sebastian.reichel@xxxxxxxxxxxxx] > > > > > Sent: 2020年9月1日 19:47 > > > > > To: dl-linux-imx <linux-imx@xxxxxxx> > > > > > Cc: linux-mmc@xxxxxxxxxxxxxxx; Bough Chen <haibo.chen@xxxxxxx>; > > > > > Shawn Guo <shawnguo@xxxxxxxxxx>; Sascha Hauer > > > > > <s.hauer@xxxxxxxxxxxxxx>; Pengutronix Kernel Team > > > > > <kernel@xxxxxxxxxxxxxx>; Fabio Estevam <festevam@xxxxxxxxx>; > > > > > Baumgartner, Claus (GE Healthcare) > > > > > <claus.baumgartner@xxxxxxxxxx> > > > > > Subject: Re: mmc0: Timeout waiting for hardware cmd interrupt on > > > > > i.MX535 > > > > > > > > > > Hi, > > > > > > > > > > [add i.MX architecture maintainers to Cc] > > > > > > > > > > On Tue, Sep 01, 2020 at 07:37:31AM +0000, Baumgartner, Claus (GE > > > > > Healthcare) wrote: > > > > > > We have a board with an i.MX535 using a Samsung eMMC as > > > > > > persistent storage connected to eSDHCv3. Every now and then we > > > > > > produce a build that causes emmc timeouts: > > > > > > > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: Timeout waiting for > > > > > > hardware cmd > > > > > interrupt. > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: ============ SDHCI > > > > > > REGISTER DUMP =========== Aug 28 07:32:12 csmon kernel: mmc0: > > > sdhci: Sys addr: > > > > > > 0xe3f12000 | Version: 0x00001201 Aug 28 07:32:12 csmon kernel: > > > mmc0: > > > > > > sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000001 Aug 28 > > > > > > 07:32:12 csmon > > > > > kernel: mmc0: sdhci: Argument: 0x00010000 | Trn mode: > > > > > 0x00000000 > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Present: 0x01f80008 | > > > Host > > > > > ctl: 0x00000031 > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Power: 0x00000002 | > > > Blk > > > > > gap: 0x00000000 > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Wake-up: 0x00000000 > | > > > > > Clock: 0x0000011f > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Timeout: 0x0000008e | > > > Int > > > > > stat: 0x00000000 > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Int enab: > > > > > > 0x107f000b | Sig > > > > > > enab: 0x107f000b Aug 28 07:32:12 csmon kernel: mmc0: sdhci: > > > > > > ACmd > > > stat: > > > > > 0x00000000 | Slot int: 0x00001201 > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Caps: 0x07eb0000 | > > > > > Caps_1: 0x08100810 > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Cmd: 0x00000d1a | > > > > > Max curr: 0x00000000 > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[0]: 0x00400900 | > > > > > Resp[1]: 0x00000000 > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Resp[2]: 0x00000000 | > > > > > Resp[3]: 0x00000000 > > > > > > Aug 28 07:32:12 csmon kernel: mmc0: sdhci: Host ctl2: > > > > > > 0x00000000 Aug > > > > > > 28 07:32:12 csmon kernel: mmc0: sdhci: ADMA Err: 0x00000000 | > > > > > > ADMA > > > > > > Ptr: 0xef041208 Aug 28 07:32:12 csmon kernel: mmc0: sdhci: > > > > > > ============================================ > > > > > > > > > > Some extra information: The timeout always has cmd = 0x00000d1a > > > > > (MMC_SEND_STATUS) and resp[0] = 0x00400900 with resp[0] > > > > > translating to this IIUIC: > > > > > > > > > > Bit 8 = Ready for data > > > > > Bit 11 = CURRENT_STATE is TRAN > > > > > Bit 22 = Illegal command > > > > > > > > According to the code logic, since this cmd13 get hardware cmd > > > > timeout, which means this cmd13 do not get any response. Here the > > > > Resp[0] should be the previous command's response. > > > > > > > > So this means the previous command is an illegal command, cause > > > > the emmc internal firmware stuck, and can't response to the next cmd13. > > > > > > > > I think we need to firstly identify the specific place in emmc > > > > driver which trigger the log dump. > > > > > > My understanding is, that a missing response from the eMMC should > > > trigger the Command Timeout Error Status IRQ in eSDHC after 64 SDCLK > > > cycles (see section 30.7.10 [ESDHCV3x_IRQSTAT] in the i.MX53 reference > manual). > > > 64 SDCLK cycles means, that this should recover quickly and would > > > not be a problem for most usecases. > > > > > > But what we are seeing is the software 10 seconds timeout. My > > > understanding is, that this should not be triggered if the SDHCI > > > controller works as expected (e.g. by generating a IRQ for the > > > timeout). This timeout is much more problematic, since all eMMC accessing > processes block for those 10 seconds. > > > > > > > Agree, only one possibility, the cmd13 do not send out successfully. > > I think there are two possibilities: > > 1. The command is not send out, so no IRQs are received. > 2. The IRQ gets lost or is not generated > > The esdhc_writel_le() has a workaround to avoid missing the card irq. If that > does not fully fix the issue, I would expect the SW fallback to also cover that > case. > > > The count of 64 SDCLK cycle only trigged by the end of the sending > > command. If the command still not send out completely, then should > > trigger the 10s sw timeout. Let me double confirm with our IC team. > > Ack. > > > I still suggest that we need first to find which cmd13 in our mmc > > driver meet this issue. > > We will try to figure that out and report back. Needs a bit of time, since the > error only appears after some hours on an affected kernel and adding the > necessary code potentially hides the problem due to the alignment changes > requiring another run with padding nops. > > -- Sebastian I changed the software so that it generates a back trace when the timeout occurs and I had to pad with a few NOP instructions again to get the timeouts to occur. I run the software on two units and got all together 22 timeouts during 2 days. All back traces were same as below: [ 2178.405338] mmc0: Timeout waiting for hardware cmd interrupt. [ 2178.405354] mmc0: sdhci: ============ SDHCI REGISTER DUMP =========== [ 2178.405367] mmc0: sdhci: Sys addr: 0xe5f1a000 | Version: 0x00001201 [ 2178.405374] mmc0: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000001 [ 2178.405380] mmc0: sdhci: Argument: 0x00010000 | Trn mode: 0x00000000 [ 2178.405389] mmc0: sdhci: Present: 0x01f80008 | Host ctl: 0x00000031 [ 2178.405395] mmc0: sdhci: Power: 0x00000002 | Blk gap: 0x00000000 [ 2178.405401] mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x0000011f [ 2178.405407] mmc0: sdhci: Timeout: 0x0000008e | Int stat: 0x00000000 [ 2178.405413] mmc0: sdhci: Int enab: 0x107f000b | Sig enab: 0x107f000b [ 2178.405420] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00001201 [ 2178.405426] mmc0: sdhci: Caps: 0x07eb0000 | Caps_1: 0x08100810 [ 2178.405432] mmc0: sdhci: Cmd: 0x00000d1a | Max curr: 0x00000000 [ 2178.405438] mmc0: sdhci: Resp[0]: 0x00400900 | Resp[1]: 0x00000000 [ 2178.405444] mmc0: sdhci: Resp[2]: 0x00000000 | Resp[3]: 0x00000000 [ 2178.405449] mmc0: sdhci: Host ctl2: 0x00000000 [ 2178.405456] mmc0: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0xef041208 [ 2178.405460] mmc0: sdhci: ============================================ [ 2178.405681] CPU: 0 PID: 140 Comm: kworker/0:4H Not tainted 5.8.3 #1 [ 2178.405689] Hardware name: Freescale i.MX53 (Device Tree Support) [ 2178.405715] Workqueue: mmc_complete mmc_blk_mq_complete_work [ 2178.405723] Backtrace: [ 2178.405745] [<8010e0bc>] (dump_backtrace) from [<8010e480>] (show_stack+0x20/0x24) [ 2178.405755] r7:00000005 r6:200b0013 r5:00000000 r4:80f46ff4 [ 2178.405775] [<8010e460>] (show_stack) from [<8046c024>] (dump_stack+0xa4/0xb8) [ 2178.405795] [<8046bf80>] (dump_stack) from [<80765890>] (mmc_wait_for_cmd+0x94/0xd0) [ 2178.405804] r7:00000005 r6:80f04f48 r5:00000000 r4:ed7abe14 [ 2178.405817] [<807657fc>] (mmc_wait_for_cmd) from [<8076c338>] (__mmc_send_status+0x78/0xa8) [ 2178.405826] r8:00000000 r7:00000005 r6:ed7abe68 r5:ee137800 r4:80f04f48 [ 2178.405844] [<8076c2c0>] (__mmc_send_status) from [<80775d7c>] (card_busy_detect+0x5c/0xf8) [ 2178.405853] r7:ee137800 r6:0002ddba r5:80f03d00 r4:0002d9d2 [ 2178.405864] [<80775d20>] (card_busy_detect) from [<80778d68>] (mmc_blk_mq_complete_prev_req.part.4+0x154/0x254) [ 2178.405874] r10:ee137800 r9:ee3034a8 r8:00000000 r7:ee21eec4 r6:80f04f48 r5:ee303400 [ 2178.405879] r4:ee21ee08 [ 2178.405889] [<80778c14>] (mmc_blk_mq_complete_prev_req.part.4) from [<80779704>] (mmc_blk_mq_complete_work+0x30/0x34) [ 2178.405899] r10:80faaa50 r9:00000000 r8:00000000 r7:eefe4800 r6:eefdcb00 r5:ed5a4700 [ 2178.405905] r4:ee21eed8 [ 2178.405917] [<807796d4>] (mmc_blk_mq_complete_work) from [<8014221c>] (process_one_work+0x238/0x590) [ 2178.405926] [<80141fe4>] (process_one_work) from [<80142a20>] (worker_thread+0x60/0x5b8) [ 2178.405936] r10:eefdcb00 r9:80f03d00 r8:eefdcb18 r7:00000008 r6:eefdcb00 r5:ed5a4714 [ 2178.405941] r4:ed5a4700 [ 2178.405956] [<801429c0>] (worker_thread) from [<8014957c>] (kthread+0x170/0x174) [ 2178.405966] r10:ed92de74 r9:ed5a4700 r8:801429c0 r7:ed7aa000 r6:00000000 r5:edbd8ac0 [ 2178.405971] r4:eda67180 [ 2178.405982] [<8014940c>] (kthread) from [<80100128>] (ret_from_fork+0x14/0x2c) [ 2178.405989] Exception stack(0xed7abfb0 to 0xed7abff8) [ 2178.405997] bfa0: 00000000 00000000 00000000 00000000 [ 2178.406007] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 2178.406015] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 [ 2178.406027] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:8014940c [ 2178.406032] r4:edbd8ac0 -Claus- > > > > > > > Timeouts do not occur with every build. After some debugging I > > > > > > have found that timeouts seem to depend on code alignment of > > > > > > the esdhc_readl_le function. I have bisected the behavior by > > > > > > using the System.map and moving/padding the code with NOP > > > > > > instructions (mov r0,r0). > > > > > > > > > > > > My test case has 5 processes continuously creating a file, > > > > > > writing random long data, reading data and deleting the file. > > > > > > It seems that when the esdhc_writel_le is aligned on a certain > > > > > > address then the timeout will occur about 5 times/12h using my > > > > > > test case. If I add one more NOP, the timeout will not occur > > > > > > at all. If I continue adding some more NOPs the timeouts come > > > > > > back. Seems that it doesn't matter where in the code I add > > > > > > NOPs as long as the address is below the address of esdhc_writel_le. > > > > > > > > > > > > We also run the same software on a dual core i.MX6 without any > > > > > > timeout issues. > > > > > > > > > > And the same kernel binary is also used on an i.MX6 single core > > > > > (albeit with different SW) withot triggering the problem so far. > > > > > > > > > > > I have reproduced this with kernel version 4.19.94 and 5.8.3 > > > > > > and we have compiled with both gcc8 and gcc9. I'm still > > > > > > searching for the root cause and I would appreciate any > > > > > > thoughts about where to go > > > next. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > -Claus- > > > > > > > > > > To me it looks like it might involve an unknown hardware errata > > > > > for i.MX53, but there has been one similar report before > > > > > (unfortunately without the full register dump) involving virtualization: > > > > > > > > > > https://patchwork.kernel.org/patch/10705823/ > > > > > > > > > > Note, that Claus' kernel has been built with CONFIG_PREEMPT_NONE=y. > > > > > > > > > > -- Sebastian