Re: [PATCH 2/7] thermal: intel: pch: enhance overheat handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, May 5, 2022 at 3:58 AM Zhang Rui <rui.zhang@xxxxxxxxx> wrote:
>
> Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to PCH
> temperature above threshold") introduces delay loop mechanism that allows
> PCH temperature to go down below threshold during suspend so it won't
> block S0ix. And the default overall delay timeout is 1 second.
>
> However, in practice, we found that the time it takes to cool the PCH down
> below threshold highly depends on the initial PCH temperature when the
> delay starts, as well as the ambient temperature.
> And in some cases, the 1 second delay is not sufficient. As a result, the
> system stays in a shallower power state like PCx instead of S0ix, and
> drains the battery power, without user' notice.
>
> To make sure S0ix is not blocked by the PCH overheating, we
> 1. expand the default overall timeout to 60 seconds.
> 2. make sure the temperature is below threshold rather than equal to it.
> 3. move the delay to .suspend_noirq phase instead, in order to
>    a) do cooling delay with a more quiescent system
>    b) be aware of wakeup events during the long delay, because some wakeup
>       events (ACPI Power button Press, USB mouse, etc) become valid only
>       in .suspend_noirq phase and later.
>
> This may introduce longer suspend time, but only in the cases when the
> system overheats and Linux used to enter a shallower S2idle state, say,
> PCx instead of S0ix.
>
> Signed-off-by: Zhang Rui <rui.zhang@xxxxxxxxx>
> Tested-by: Sumeet Pawnikar <sumeet.r.pawnikar@xxxxxxxxx>
> ---
>  drivers/thermal/intel/intel_pch_thermal.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c
> index 527c91f5960b..b7b32e2f5ae2 100644
> --- a/drivers/thermal/intel/intel_pch_thermal.c
> +++ b/drivers/thermal/intel/intel_pch_thermal.c
> @@ -70,8 +70,8 @@ static unsigned int delay_timeout = 100;
>  module_param(delay_timeout, int, 0644);
>  MODULE_PARM_DESC(delay_timeout, "amount of time delay for each iteration.");
>
> -/* Number of iterations for cooling delay, 10 counts by default for now */
> -static unsigned int delay_cnt = 10;
> +/* Number of iterations for cooling delay, 600 counts by default for now */
> +static unsigned int delay_cnt = 600;
>  module_param(delay_cnt, int, 0644);
>  MODULE_PARM_DESC(delay_cnt, "total number of iterations for time delay.");
>
> @@ -193,10 +193,11 @@ static int pch_wpt_get_temp(struct pch_thermal_device *ptd, int *temp)
>         return 0;
>  }
>
> +/* Cool the PCH when it's overheat in .suspend_noirq phase */
>  static int pch_wpt_suspend(struct pch_thermal_device *ptd)
>  {
>         u8 tsel;
> -       u8 pch_delay_cnt = 1;
> +       int pch_delay_cnt = 1;
>         u16 pch_thr_temp, pch_cur_temp;
>
>         /* Shutdown the thermal sensor if it is not enabled by BIOS */
> @@ -233,7 +234,10 @@ static int pch_wpt_suspend(struct pch_thermal_device *ptd)
>          * which helps to indentify the reason why S0ix entry was rejected.
>          */
>         while (pch_delay_cnt <= delay_cnt) {
> -               if (pch_cur_temp <= pch_thr_temp)
> +               if (pch_cur_temp < pch_thr_temp)
> +                       break;
> +
> +               if (pm_wakeup_pending())
>                         break;
>
>                 dev_warn(&ptd->pdev->dev,
> @@ -245,7 +249,7 @@ static int pch_wpt_suspend(struct pch_thermal_device *ptd)
>                 pch_delay_cnt++;
>         }
>
> -       if (pch_cur_temp > pch_thr_temp)
> +       if (pch_cur_temp >= pch_thr_temp)
>                 dev_warn(&ptd->pdev->dev,
>                         "CPU-PCH is hot [%dC] even after delay, continue to suspend. S0ix might fail\n",
>                         pch_cur_temp);
> @@ -455,7 +459,7 @@ static void intel_pch_thermal_remove(struct pci_dev *pdev)
>         pci_disable_device(pdev);
>  }
>
> -static int intel_pch_thermal_suspend(struct device *device)
> +static int intel_pch_thermal_suspend_noirq(struct device *device)
>  {
>         struct pch_thermal_device *ptd = dev_get_drvdata(device);
>
> @@ -495,7 +499,7 @@ static const struct pci_device_id intel_pch_thermal_id[] = {
>  MODULE_DEVICE_TABLE(pci, intel_pch_thermal_id);
>
>  static const struct dev_pm_ops intel_pch_pm_ops = {
> -       .suspend = intel_pch_thermal_suspend,
> +       .suspend_noirq = intel_pch_thermal_suspend_noirq,

IMO it would be better to put this change into a separate patch and
reorder the other changes after this one.  It is valid by itself.

>         .resume = intel_pch_thermal_resume,
>  };
>
> --



[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux