Re: [PATCH] mmc: tmio: remove workaround for NON_REMOVABLE

Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> · Thu, 21 Nov 2019 10:35:24 +0100

Hi Wolfram,

On Thu, Nov 21, 2019 at 9:57 AM Wolfram Sang <wsa@xxxxxxxxxxxxx> wrote:
> > So some of my local code on top must have impacted the behavior.
>
> Any change in temperature? Niklas and I wonder if it is thermal related.

Nope. I tried an old "known" good kernel again yesterday, and it worked.
That was BTW the one which had the additional debug prints:

--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -199,6 +199,7 @@ static struct generic_pm_domain
*dev_to_genpd(struct device *dev)
 static int genpd_stop_dev(const struct generic_pm_domain *genpd,
                          struct device *dev)
 {
+pr_info("==== %s/%s: stop\n", genpd->name, dev_name(dev));
        WARN(device_may_wakeup(dev),
             "Domain %s must be active_wakeup for wakeup source %s\n",
             genpd->name, dev_name(dev));
@@ -208,6 +209,7 @@ static int genpd_stop_dev(const struct
generic_pm_domain *genpd,
 static int genpd_start_dev(const struct generic_pm_domain *genpd,
                           struct device *dev)
 {
+pr_info("==== %s/%s: start\n", genpd->name, dev_name(dev));
        return GENPD_DEV_CALLBACK(genpd, int, start, dev);
 }

Removing those prints made the old kernel fail, too, so this is why I think
it is a race with Runtime PM.

With a tree based on latest renesas-drivers, it happens regardless of those
debug prints.

> > > I am working on an issue where the SCC hangs, but this has to do with
> > > always providing the SCC clock (SDnH). I don't really see the connection
> > > of that to RuntimePM yet, though :/
> >
> > Makes sense: this is consistent with the behavior when accessing
> > registers without enabling the corresponding module clock: it hangs.
> > So this can happen with other clocks, too.
> > One more reason not to delegate clock handling to a guest, as doing it
> > wrong can take down the host, too...
>
> You mean when it comes to virtualization?

Exactly.

> > > Can you test this simple workaround patch instead of the revert just so
> > > we get an idea if these issues are related?
> >
> > Thanks, applying your workaround on top of
> > renesas-drivers-2019-11-19-v5.4-rc8 fixes the issue.
>
> Ok, good to know thanks for testing. Currently, I wonder why reverting
> the NON_REMOVABLE workaround makes a difference. Maybe it is not
> temperature related but a some race with RPM? I am debugging in this
> direction now. But the lockup is still hard to trigger for me. Tried
> v5.4-rc8 + NON_REMOVABLE patch with no luck. Will try renesas-drivers
> next.

As I managed to bisect it, it was fairly reproducible for me. Just checkout
commit 7a7dab237027939c ("mmc: tmio: remove workaround for NON_REMOVABLE"),
or use renesas-drivers.

Oh, if it's a race, it may be affected by the compiler, too.
gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1)

> > This fix is part of renesas/topic/sdhi-manual-calib, right?
>
> Yes.
>
> > And thus has been present in some renesas-drivers release, but was
> > dropped _before_ the 2019-10-15-v5.4-rc3 release.
>
> That would explain why it didn't show up before, right? And don't you

Not exactly. That branch was dropped before Ulf reverted the
NON_REMOVABLE workaround.

> have a Ebisu in your board farm, too? Luckily, I have one, too, now. It
> should be affected.

Haven't seen the issue on Ebisu (yet?).
To be sure, I have just retried again with the exact same kernel image
and userland: m3n-salvator-xs hangs, ebisu boots fine (and I can read
/dev/mmcblk2).

But as it looks to be timing-related, and E3 has different/less CPU cores,
it may still be affected.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds