RE: sdhci timeout on imx8mq

Bough Chen <haibo.chen@xxxxxxx> · Wed, 6 Jan 2021 09:29:45 +0000

> -----Original Message-----
> From: Lucas Stach [mailto:l.stach@xxxxxxxxxxxxxx]
> Sent: 2021年1月5日 23:07
> To: Bough Chen <haibo.chen@xxxxxxx>; Fabio Estevam
> <festevam@xxxxxxxxx>; Angus Ainslie <angus@xxxxxxxx>; Leonard Crestez
> <leonard.crestez@xxxxxxx>; Peng Fan <peng.fan@xxxxxxx>; Abel Vesa
> <abel.vesa@xxxxxxx>; Stephen Boyd <sboyd@xxxxxxxxxx>; Michael Turquette
> <mturquette@xxxxxxxxxxxx>
> Cc: Ulf Hansson <ulf.hansson@xxxxxxxxxx>; Guido Günther <agx@xxxxxxxxxxx>;
> linux-mmc <linux-mmc@xxxxxxxxxxxxxxx>; Adrian Hunter
> <adrian.hunter@xxxxxxxxx>; dl-linux-imx <linux-imx@xxxxxxx>; Sascha Hauer
> <kernel@xxxxxxxxxxxxxx>; moderated list:ARM/FREESCALE IMX / MXC ARM
> ARCHITECTURE <linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>
> Subject: Re: sdhci timeout on imx8mq
> 
> Hi all,
> 
> Am Mittwoch, dem 08.07.2020 um 01:32 +0000 schrieb BOUGH CHEN:
> > > -----Original Message-----
> > > From: Fabio Estevam [mailto:festevam@xxxxxxxxx]
> > > Sent: 2020年7月7日 20:45
> > > To: Angus Ainslie <angus@xxxxxxxx>
> > > Cc: BOUGH CHEN <haibo.chen@xxxxxxx>; Ulf Hansson
> > > <ulf.hansson@xxxxxxxxxx>; Guido Günther <agx@xxxxxxxxxxx>; linux-
> > > mmc <linux-mmc@xxxxxxxxxxxxxxx>; Adrian Hunter
> > > <adrian.hunter@xxxxxxxxx>; dl-linux-imx <linux-imx@xxxxxxx>; Sascha
> > > Hauer < kernel@xxxxxxxxxxxxxx>; moderated list:ARM/FREESCALE IMX /
> > > MXC ARM ARCHITECTURE <linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>
> > > Subject: Re: sdhci timeout on imx8mq
> > >
> > > Hi Angus,
> > >
> > > On Tue, Jun 30, 2020 at 4:39 PM Angus Ainslie <angus@xxxxxxxx>
> > > wrote:
> > >
> > > > Has there been any progress with this. I'm getting this on about
> > > > 50% of
> > >
> > > Not from my side, sorry.
> > >
> > > Bough,
> > >
> > > Do you know why this problem affects the imx8mq-evk versions that
> > > are populated with the Micron eMMC and not the ones with Sandisk
> > > eMMC?
> >
> > Hi Angus,
> >
> > Can you show me the full fail log? I do not meet this issue on my
> > side, besides, which kind of uboot do you use?
> 
> I was finally able to bisect this issue, which wasn't that much fun due to the
> issue not being reproducible 100%. :/ Turns out that the issue is even more
> interesting than I thought and likely doesn't have anything to do with SDHCI or
> used bootloader versions. Here's my current debugging state:
> 
> I've bisected the issue down to b04383b6a558 (clk: imx8mq: Define gates for
> pll1/2 fixed dividers). The change itself looks fine to me, still CC'ed Leonard for
> good measure.
> 
> In my testing the following partial revert fixes the issue:
> 
> --- a/drivers/clk/imx/clk-imx8mq.c
> +++ b/drivers/clk/imx/clk-imx8mq.c
> @@ -365,7 +365,7 @@ static int imx8mq_clocks_probe(struct
> platform_device *pdev)
>         hws[IMX8MQ_SYS1_PLL_133M_CG] =
> imx_clk_hw_gate("sys1_pll_133m_cg", "sys1_pll_out", base + 0x30, 15);
>         hws[IMX8MQ_SYS1_PLL_160M_CG] =
> imx_clk_hw_gate("sys1_pll_160m_cg", "sys1_pll_out", base + 0x30, 17);
>         hws[IMX8MQ_SYS1_PLL_200M_CG] =
> imx_clk_hw_gate("sys1_pll_200m_cg", "sys1_pll_out", base + 0x30, 19);
> -       hws[IMX8MQ_SYS1_PLL_266M_CG] =
> imx_clk_hw_gate("sys1_pll_266m_cg", "sys1_pll_out", base + 0x30, 21);
>         hws[IMX8MQ_SYS1_PLL_400M_CG] =
> imx_clk_hw_gate("sys1_pll_400m_cg", "sys1_pll_out", base + 0x30, 23);
>         hws[IMX8MQ_SYS1_PLL_800M_CG] =
> imx_clk_hw_gate("sys1_pll_800m_cg", "sys1_pll_out", base + 0x30, 25);
> 
> @@ -375,7 +375,7 @@ static int imx8mq_clocks_probe(struct
> platform_device *pdev)
>         hws[IMX8MQ_SYS1_PLL_133M] =
> imx_clk_hw_fixed_factor("sys1_pll_133m", "sys1_pll_133m_cg", 1, 6);
>         hws[IMX8MQ_SYS1_PLL_160M] =
> imx_clk_hw_fixed_factor("sys1_pll_160m", "sys1_pll_160m_cg", 1, 5);
>         hws[IMX8MQ_SYS1_PLL_200M] =
> imx_clk_hw_fixed_factor("sys1_pll_200m", "sys1_pll_200m_cg", 1, 4);
> -       hws[IMX8MQ_SYS1_PLL_266M] =
> imx_clk_hw_fixed_factor("sys1_pll_266m", "sys1_pll_266m_cg", 1, 3);
> +       hws[IMX8MQ_SYS1_PLL_266M] =
> + imx_clk_hw_fixed_factor("sys1_pll_266m", "sys1_pll_out", 1, 3);
>         hws[IMX8MQ_SYS1_PLL_400M] =
> imx_clk_hw_fixed_factor("sys1_pll_400m", "sys1_pll_400m_cg", 1, 2);
>         hws[IMX8MQ_SYS1_PLL_800M] =
> imx_clk_hw_fixed_factor("sys1_pll_800m", "sys1_pll_800m_cg", 1, 1);
> 
> The sys1_pll_266m is the parent of nand_usdhc_bus. I've validated that the
> SDHCI driver properly enables this bus clock across the problematic card access.
> So what I think is happening here is that both nand_usdhc_bus and
> sys1_pll_266m are initially enabled. Sometime during boot sys1_pll_266m gets
> disabled due to runtime PM on the enet_axi clock, which is a direct child of
> sys1_pll_266m. At this point nand_usdhc_bus is still enabled, but no consumer
> has claimed the clock yet, so the parent clock gets disabled while this branch of
> the clock tree is still active.

Hi Lucas,

According to the clock tree, if nand_usdhc_bus is still enabled, then sys1_pll_266m has no chance to disable.

    sys1_pll_266m_cg                  1        1        0   800000000          0     0  50000         Y
       sys1_pll_266m                  1        1        0   266666666          0     0  50000         Y
          nand_usdhc_bus              0        0        0   266666666          0     0  50000         N
             nand_usdhc_rawnand_clk       0        0        0   266666666          0     0  50000         N
          enet_axi                    1        1        0   266666666          0     0  50000         Y
             enet1_root_clk           2        2        0   266666666          0     0  50000         Y


This issue seems related with the following errta:

e11232: USDHC: uSDHC setting requirement for IPG_CLK and AHB_BUS clocks
Description: uSDHC AHB_BUS and IPG_CLK clocks must be synchronized.
Due to current physical design implementation, AHB_BUS and IPG_CLK must come from
same clock source to maintain clock sync.
Workaround: Set AHB_BUS and IPG_CLK to clock source from PLL1.

After sys1_pll_266m gate off/on, seems need to sync the USDHC AHB bus and USDHC IPG_clk again. (Here usdhc AHB BUS source from nand_usdhc_bus.)
This sync is handle by hardware, and maybe need some time, during this sync period, usdhc operation may has issue.

I just double check our local v5.10 branch, already revert the commit b04383b6a558 (clk: imx8mq: Define gates for pll1/2 fixed dividers).
So to fix this issue, one method is revert this patch, another method is keep the 'nand_usdhc_bus' always on. Add change like this:

diff --git a/drivers/clk/imx/clk-imx8mq.c b/drivers/clk/imx/clk-imx8mq.c
index 779ea69e639c..939806b36916 100644
--- a/drivers/clk/imx/clk-imx8mq.c
+++ b/drivers/clk/imx/clk-imx8mq.c
@@ -433,7 +433,7 @@ static int imx8mq_clocks_probe(struct platform_device *pdev)
        /* BUS */
        hws[IMX8MQ_CLK_MAIN_AXI] = imx8m_clk_hw_composite_bus_critical("main_axi", imx8mq_main_axi_sels, base + 0x8800);
        hws[IMX8MQ_CLK_ENET_AXI] = imx8m_clk_hw_composite_bus("enet_axi", imx8mq_enet_axi_sels, base + 0x8880);
-       hws[IMX8MQ_CLK_NAND_USDHC_BUS] = imx8m_clk_hw_composite_bus("nand_usdhc_bus", imx8mq_nand_usdhc_sels, base + 0x8900);
+       hws[IMX8MQ_CLK_NAND_USDHC_BUS] = imx8m_clk_hw_composite_bus_critical("nand_usdhc_bus", imx8mq_nand_usdhc_sels, base + 0x8900);
        hws[IMX8MQ_CLK_VPU_BUS] = imx8m_clk_hw_composite_bus("vpu_bus", imx8mq_vpu_bus_sels, base + 0x8980);
        hws[IMX8MQ_CLK_DISP_AXI] = imx8m_clk_hw_composite_bus("disp_axi", imx8mq_disp_axi_sels, base + 0x8a00);
        hws[IMX8MQ_CLK_DISP_APB] = imx8m_clk_hw_composite_bus("disp_apb", imx8mq_disp_apb_sels, base + 0x8a80);


What you think? Or any other suggestion?

> 
> The reference manual states about this situation: "For any clock, its source
> must be left on when it is kept on. Behavior is undefined if this rule is violated."
> And it seems this is exactly what's happening here: some kind of glitch is
> introduced in the nand_usdhc_bus clock, which prevents the SDHCI controller
> from working, even though the clock branch is properly enabled later on. On my
> system the SDHCI timeout and following runtime suspend/resume cycle on the
> nand_usdhc_bus clock seem to get it back into a working state.
> 
> So I think we need some solution at the clock driver/framework level to prevent
> shutting down parent clocks that have active branches, even if those branches
> aren't claimed by a consumer (yet).
> 
> Regards,
> Lucas