Re: [PATCH] mmc: tmio: remove workaround for NON_REMOVABLE

Ulf Hansson <ulf.hansson@xxxxxxxxxx> · Thu, 21 Nov 2019 11:29:30 +0100

Geert, Wolfram

On Thu, 21 Nov 2019 at 10:35, Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> wrote:
>
> Hi Wolfram,
>
> On Thu, Nov 21, 2019 at 9:57 AM Wolfram Sang <wsa@xxxxxxxxxxxxx> wrote:
> > > So some of my local code on top must have impacted the behavior.
> >
> > Any change in temperature? Niklas and I wonder if it is thermal related.
>
> Nope. I tried an old "known" good kernel again yesterday, and it worked.
> That was BTW the one which had the additional debug prints:
>
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -199,6 +199,7 @@ static struct generic_pm_domain
> *dev_to_genpd(struct device *dev)
>  static int genpd_stop_dev(const struct generic_pm_domain *genpd,
>                           struct device *dev)
>  {
> +pr_info("==== %s/%s: stop\n", genpd->name, dev_name(dev));
>         WARN(device_may_wakeup(dev),
>              "Domain %s must be active_wakeup for wakeup source %s\n",
>              genpd->name, dev_name(dev));
> @@ -208,6 +209,7 @@ static int genpd_stop_dev(const struct
> generic_pm_domain *genpd,
>  static int genpd_start_dev(const struct generic_pm_domain *genpd,
>                            struct device *dev)
>  {
> +pr_info("==== %s/%s: start\n", genpd->name, dev_name(dev));
>         return GENPD_DEV_CALLBACK(genpd, int, start, dev);
>  }
>
> Removing those prints made the old kernel fail, too, so this is why I think
> it is a race with Runtime PM.
>
> With a tree based on latest renesas-drivers, it happens regardless of those
> debug prints.
>
> > > > I am working on an issue where the SCC hangs, but this has to do with
> > > > always providing the SCC clock (SDnH). I don't really see the connection
> > > > of that to RuntimePM yet, though :/
> > >
> > > Makes sense: this is consistent with the behavior when accessing
> > > registers without enabling the corresponding module clock: it hangs.
> > > So this can happen with other clocks, too.
> > > One more reason not to delegate clock handling to a guest, as doing it
> > > wrong can take down the host, too...
> >
> > You mean when it comes to virtualization?
>
> Exactly.
>
> > > > Can you test this simple workaround patch instead of the revert just so
> > > > we get an idea if these issues are related?
> > >
> > > Thanks, applying your workaround on top of
> > > renesas-drivers-2019-11-19-v5.4-rc8 fixes the issue.
> >
> > Ok, good to know thanks for testing. Currently, I wonder why reverting
> > the NON_REMOVABLE workaround makes a difference. Maybe it is not
> > temperature related but a some race with RPM? I am debugging in this
> > direction now. But the lockup is still hard to trigger for me. Tried
> > v5.4-rc8 + NON_REMOVABLE patch with no luck. Will try renesas-drivers
> > next.
>
> As I managed to bisect it, it was fairly reproducible for me. Just checkout
> commit 7a7dab237027939c ("mmc: tmio: remove workaround for NON_REMOVABLE"),
> or use renesas-drivers.
>
> Oh, if it's a race, it may be affected by the compiler, too.
> gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)
>
> > > This fix is part of renesas/topic/sdhi-manual-calib, right?
> >
> > Yes.
> >
> > > And thus has been present in some renesas-drivers release, but was
> > > dropped _before_ the 2019-10-15-v5.4-rc3 release.
> >
> > That would explain why it didn't show up before, right? And don't you
>
> Not exactly. That branch was dropped before Ulf reverted the
> NON_REMOVABLE workaround.
>
> > have a Ebisu in your board farm, too? Luckily, I have one, too, now. It
> > should be affected.
>
> Haven't seen the issue on Ebisu (yet?).
> To be sure, I have just retried again with the exact same kernel image
> and userland: m3n-salvator-xs hangs, ebisu boots fine (and I can read
> /dev/mmcblk2).
>
> But as it looks to be timing-related, and E3 has different/less CPU cores,
> it may still be affected.

A few comments around your runtime PM concerns, not sure if this
matters to you issues.

So, only by looking at the code for probing of the tmio variant
drivers, it seems like there are accesses (both reads and writes) for
the device being done, without first making sure that the clock
managed by the PM domain has been enabled. Is that really okay? Note,
this isn't a new thing, but it has been there as long as can remember.

For example, in renesas_sdhi_probe() there are calls made to
sd_ctrl_write16|read16() before calling tmio_mmc_host_probe().

Additionally in tmio_mmc_host_probe(), there are calls made to
sd_ctrl_write16|read16(), before calling pm_runtime_get_sync().

If you haven't noticed, we have also managed to replace the call with
pm_runtime_get_sync() with a call to dev_pm_domain_start(), to
simplify the code. The point is, these changes are queued via Rafael's
pm-tree for v5.5.

So, perhaps at this point we should simply drop the offending commit
and revisit this once v5.5-rc1 is out?

Kind regards
Uffe