Hi Wolfram, On Thu, Nov 21, 2019 at 9:57 AM Wolfram Sang <wsa@xxxxxxxxxxxxx> wrote: > > So some of my local code on top must have impacted the behavior. > > Any change in temperature? Niklas and I wonder if it is thermal related. Nope. I tried an old "known" good kernel again yesterday, and it worked. That was BTW the one which had the additional debug prints: --- a/drivers/base/power/domain.c +++ b/drivers/base/power/domain.c @@ -199,6 +199,7 @@ static struct generic_pm_domain *dev_to_genpd(struct device *dev) static int genpd_stop_dev(const struct generic_pm_domain *genpd, struct device *dev) { +pr_info("==== %s/%s: stop\n", genpd->name, dev_name(dev)); WARN(device_may_wakeup(dev), "Domain %s must be active_wakeup for wakeup source %s\n", genpd->name, dev_name(dev)); @@ -208,6 +209,7 @@ static int genpd_stop_dev(const struct generic_pm_domain *genpd, static int genpd_start_dev(const struct generic_pm_domain *genpd, struct device *dev) { +pr_info("==== %s/%s: start\n", genpd->name, dev_name(dev)); return GENPD_DEV_CALLBACK(genpd, int, start, dev); } Removing those prints made the old kernel fail, too, so this is why I think it is a race with Runtime PM. With a tree based on latest renesas-drivers, it happens regardless of those debug prints. > > > I am working on an issue where the SCC hangs, but this has to do with > > > always providing the SCC clock (SDnH). I don't really see the connection > > > of that to RuntimePM yet, though :/ > > > > Makes sense: this is consistent with the behavior when accessing > > registers without enabling the corresponding module clock: it hangs. > > So this can happen with other clocks, too. > > One more reason not to delegate clock handling to a guest, as doing it > > wrong can take down the host, too... > > You mean when it comes to virtualization? Exactly. > > > Can you test this simple workaround patch instead of the revert just so > > > we get an idea if these issues are related? > > > > Thanks, applying your workaround on top of > > renesas-drivers-2019-11-19-v5.4-rc8 fixes the issue. > > Ok, good to know thanks for testing. Currently, I wonder why reverting > the NON_REMOVABLE workaround makes a difference. Maybe it is not > temperature related but a some race with RPM? I am debugging in this > direction now. But the lockup is still hard to trigger for me. Tried > v5.4-rc8 + NON_REMOVABLE patch with no luck. Will try renesas-drivers > next. As I managed to bisect it, it was fairly reproducible for me. Just checkout commit 7a7dab237027939c ("mmc: tmio: remove workaround for NON_REMOVABLE"), or use renesas-drivers. Oh, if it's a race, it may be affected by the compiler, too. gcc version 7.4.0 (Ubuntu/Linaro 7.4.0-1ubuntu1~18.04.1) > > This fix is part of renesas/topic/sdhi-manual-calib, right? > > Yes. > > > And thus has been present in some renesas-drivers release, but was > > dropped _before_ the 2019-10-15-v5.4-rc3 release. > > That would explain why it didn't show up before, right? And don't you Not exactly. That branch was dropped before Ulf reverted the NON_REMOVABLE workaround. > have a Ebisu in your board farm, too? Luckily, I have one, too, now. It > should be affected. Haven't seen the issue on Ebisu (yet?). To be sure, I have just retried again with the exact same kernel image and userland: m3n-salvator-xs hangs, ebisu boots fine (and I can read /dev/mmcblk2). But as it looks to be timing-related, and E3 has different/less CPU cores, it may still be affected. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds