Re: [PATCH 0/7] PM: Solution for S0ix failure caused by PCH overheating

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, Rafael,

On Tue, 2022-05-17 at 17:11 +0200, Rafael J. Wysocki wrote:
> On Thu, May 5, 2022 at 3:58 AM Zhang Rui <rui.zhang@xxxxxxxxx> wrote:
> > 
> > On some Intel client platforms like SKL/KBL/CNL/CML, there is a
> > PCH thermal sensor that monitors the PCH temperature and blocks the
> > system
> > from entering S0ix in case it overheats.
> > 
> > Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to
> > PCH
> > temperature above threshold") introduces a delay loop to cool the
> > temperature down for this purpose.
> > 
> > However, in practice, we found that the time it takes to cool the
> > PCH down
> > below threshold highly depends on the initial PCH temperature when
> > the
> > delay starts, as well as the ambient temperature.
> > 
> > For example, on a Dell XPS 9360 laptop, the problem can be
> > triggered
> > 1. when it is suspended with heavy workload running.
> > or
> > 2. when it is moved from New Hampshire to Florida.
> > 
> > In these cases, the 1 second delay is not sufficient. As a result,
> > the
> > system stays in a shallower power state like PCx instead of S0ix,
> > and
> > drains the battery power, without user' notice.
> > 
> > In this patch series, we first fix the problem in patch 1/7 ~ 3/7,
> > by
> > 1. expand the default overall cooling delay timeout to 60 seconds.
> > 2. make sure the temperature is below threshold rather than equal
> > to it.
> > 3. move the delay to .suspend_noirq phase instead, in order to
> >    a) do the cooling when the system is in a more quiescent state
> >    b) be aware of wakeup events during the long delay, because some
> > wakeup
> >       events (ACPI Power button Press, USB mouse, etc) become valid
> > only
> >       in .suspend_noirq phase and later.
> > 
> > However, this potential long delay introduces a problem to our
> > suspend
> > stress automation test, because the delay makes it hard to predict
> > how
> > much time it takes to suspend the system.
> > As we want to do as much suspend iterations as possible in limited
> > time,
> > setting a 60+ seconds rtc alarm for suspend which usually takes
> > shorter
> > than 1 second is far beyond overkill.
> > 
> > Thus, in patch 4/7 ~ 7/7, a rtc driver hook is introduced, which
> > cancels
> > the armed rtc alarm in the beginning of suspend and then rearm the
> > rtc
> > alarm with a short interval (say, 2 second) right before system
> > suspended.
> > 
> > By running
> >  # echo 2 > /sys/module/rtc_cmos/parameters/rtc_wake_override_sec
> > before suspend, the system can be resumed by RTC alarm right after
> > it is
> > suspended, no matter how much time the suspend really takes.
> > 
> > This patch series has been tested on the same Dell XPS 9360 laptop
> > and
> > S0ix is 100% achieved across 1000+ s2idle iterations.
> 
> Overall, the first three patches in the series can go in without the
> rest, so let's put them into a separate series.
> 
> Patch [4/7] doesn't depend on the first three ones, so it can go in
> by itself.
> 
> Patch [5/7] is to be dropped anyway as per the earlier discussion.
> 
> Patch [6/7] is only needed to apply patch [7/7] which is
> controversial.
> 
> I think that we can drop or defer patches [6-7/7] for now.

This all sounds reasonable to me.
I will resend them separately.

-rui




[Index of Archives]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux