RE: Regression in 5.16-rc1 with suspend to idle

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]

> it:
> > >
> > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> > > index 04ea92cbd9cf..f5bf46782043 100644
> > > --- a/drivers/base/power/main.c
> > > +++ b/drivers/base/power/main.c
> > > @@ -32,6 +32,7 @@
> > >  #include <linux/suspend.h>
> > >  #include <trace/events/power.h>
> > >  #include <linux/cpufreq.h>
> > > +#include <linux/cpuidle.h>
> > >  #include <linux/devfreq.h>
> > >  #include <linux/timer.h>
> > >
> > > @@ -1350,6 +1351,8 @@ int dpm_suspend_noirq(pm_message_t state)
> > >  {
> > >         int ret;
> > >
> > > +       cpuidle_pause();
> >
> > Can you replace this with wake_up_all_idle_cpus() and remove the
> > cpuidle_resume()/cpuidle_pause() calls from s2idle_enter() and see
> > what happens?
> >
> > > +
> > >         device_wakeup_arm_wake_irqs();
> > >         suspend_device_irqs();
> > >
> > > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> > > index 6fcdee7e87a5..1708a643ba5d 100644
> > > --- a/kernel/power/suspend.c
> > > +++ b/kernel/power/suspend.c
> > > @@ -97,6 +97,7 @@ static void s2idle_enter(void)
> > >         raw_spin_unlock_irq(&s2idle_lock);
> > >
> > >         cpus_read_lock();
> > > +       cpuidle_resume();
> > >
> > >         /* Push all the CPUs into the idle loop. */
> > >         wake_up_all_idle_cpus();
> > > @@ -104,6 +105,7 @@ static void s2idle_enter(void)
> > >         swait_event_exclusive(s2idle_wait_head,
> > >                     s2idle_state == S2IDLE_STATE_WAKE);
> > >
> > > +       cpuidle_pause();
> > >         cpus_read_unlock();
> > >
> > >         raw_spin_lock_irq(&s2idle_lock);
> > >
> > >
> > > * Removing the cpuidle_pause call from s2idle_enter will fix the crash,
> > >   but the system doesn't enter the deepest sleep state.
> >
> > I think you mean that removing it doesn't make a difference except
> > that it prevents the deepest state from being entered.
> >
> > Pausing cpuidle here is kind of redundant, because it just flips the
> > "initialized" flag that's going to be flipped again by
> > cpuidle_resume() in the next iteration.  [BTW, your patch is missing
> > one an additional cpuidle_resume() in the resume path to match this
> > cpuidle_pause().]  Also calling  wake_up_all_idle_cpus() is not likely
> > to make a difference, so I'm guessing that synchronize_rcu() does the
> > trick.
> >
> > > * removing the cpuidle_pause call from dpm_suspend_noirq the
> firmware continues to
> > >    crash.
> >
> > So the crash occurs while running dpm_suspend_noirq().
> >
> > > I also confirmed that reverting both of these commits together on top of
> 5.17-rc4 fixes it:
> > >
> > > 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-
> idle path")
> > > 23f62d7ab25b ("PM: sleep: Pause cpuidle later and resume it earlier
> during system transitions")
> > >
> > > The commit messages at least make it sound like it was just a rework for
> unification of the codepaths,
> > > not supposed to be for anything to be actually fixed, so I would think it
> should be safe to revert.
> >
> > There is a deeper issue related to these commits and I'm not inclined
> > to go back to the old code if there is only one system affected by
> > this and the problem is related to the platform firmware on that
> > system.
> >
> > > So please advise which way you want to go and I'll send up a patch (or if
> you want to write one
> > > I'm happy to take it and test it since I can readily reproduce it).
> >
> > To start with, please test the attached debug patch on top of -rc4.
> 
> Attached is another patch to try, testing the hypothesis that the
> observed crash is related to CPUs being in idle state that are too
> deep for some reason during late suspend and early resume.

I tried 3 test kernels:
* 5.17-rc4 + Your second debugging patch
* 5.17-rc4+ Your first debugging patch
* 5.17-rc4 + A hack I wrote that pushed amd-pmc into "later" in the suspend
using a global symbol called after LPS0 instead of letting it run in noirq stage
 
It works properly on all of those, tried about 5x time in each.

Then I confirmed I could still crash it on 5.17-rc4 with my control kernel.




[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux