On Thu, May 5, 2022 at 3:58 AM Zhang Rui <rui.zhang@xxxxxxxxx> wrote: > > Commit ef63b043ac86 ("thermal: intel: pch: fix S0ix failure due to PCH > temperature above threshold") introduces delay loop mechanism that allows > PCH temperature to go down below threshold during suspend so it won't > block S0ix. And the default overall delay timeout is 1 second. > > However, in practice, we found that the time it takes to cool the PCH down > below threshold highly depends on the initial PCH temperature when the > delay starts, as well as the ambient temperature. > And in some cases, the 1 second delay is not sufficient. As a result, the > system stays in a shallower power state like PCx instead of S0ix, and > drains the battery power, without user' notice. > > To make sure S0ix is not blocked by the PCH overheating, we > 1. expand the default overall timeout to 60 seconds. > 2. make sure the temperature is below threshold rather than equal to it. > 3. move the delay to .suspend_noirq phase instead, in order to > a) do cooling delay with a more quiescent system > b) be aware of wakeup events during the long delay, because some wakeup > events (ACPI Power button Press, USB mouse, etc) become valid only > in .suspend_noirq phase and later. > > This may introduce longer suspend time, but only in the cases when the > system overheats and Linux used to enter a shallower S2idle state, say, > PCx instead of S0ix. > > Signed-off-by: Zhang Rui <rui.zhang@xxxxxxxxx> > Tested-by: Sumeet Pawnikar <sumeet.r.pawnikar@xxxxxxxxx> > --- > drivers/thermal/intel/intel_pch_thermal.c | 18 +++++++++++------- > 1 file changed, 11 insertions(+), 7 deletions(-) > > diff --git a/drivers/thermal/intel/intel_pch_thermal.c b/drivers/thermal/intel/intel_pch_thermal.c > index 527c91f5960b..b7b32e2f5ae2 100644 > --- a/drivers/thermal/intel/intel_pch_thermal.c > +++ b/drivers/thermal/intel/intel_pch_thermal.c > @@ -70,8 +70,8 @@ static unsigned int delay_timeout = 100; > module_param(delay_timeout, int, 0644); > MODULE_PARM_DESC(delay_timeout, "amount of time delay for each iteration."); > > -/* Number of iterations for cooling delay, 10 counts by default for now */ > -static unsigned int delay_cnt = 10; > +/* Number of iterations for cooling delay, 600 counts by default for now */ > +static unsigned int delay_cnt = 600; > module_param(delay_cnt, int, 0644); > MODULE_PARM_DESC(delay_cnt, "total number of iterations for time delay."); > > @@ -193,10 +193,11 @@ static int pch_wpt_get_temp(struct pch_thermal_device *ptd, int *temp) > return 0; > } > > +/* Cool the PCH when it's overheat in .suspend_noirq phase */ > static int pch_wpt_suspend(struct pch_thermal_device *ptd) > { > u8 tsel; > - u8 pch_delay_cnt = 1; > + int pch_delay_cnt = 1; > u16 pch_thr_temp, pch_cur_temp; > > /* Shutdown the thermal sensor if it is not enabled by BIOS */ > @@ -233,7 +234,10 @@ static int pch_wpt_suspend(struct pch_thermal_device *ptd) > * which helps to indentify the reason why S0ix entry was rejected. > */ > while (pch_delay_cnt <= delay_cnt) { > - if (pch_cur_temp <= pch_thr_temp) > + if (pch_cur_temp < pch_thr_temp) > + break; > + > + if (pm_wakeup_pending()) > break; > > dev_warn(&ptd->pdev->dev, > @@ -245,7 +249,7 @@ static int pch_wpt_suspend(struct pch_thermal_device *ptd) > pch_delay_cnt++; > } > > - if (pch_cur_temp > pch_thr_temp) > + if (pch_cur_temp >= pch_thr_temp) > dev_warn(&ptd->pdev->dev, > "CPU-PCH is hot [%dC] even after delay, continue to suspend. S0ix might fail\n", > pch_cur_temp); > @@ -455,7 +459,7 @@ static void intel_pch_thermal_remove(struct pci_dev *pdev) > pci_disable_device(pdev); > } > > -static int intel_pch_thermal_suspend(struct device *device) > +static int intel_pch_thermal_suspend_noirq(struct device *device) > { > struct pch_thermal_device *ptd = dev_get_drvdata(device); > > @@ -495,7 +499,7 @@ static const struct pci_device_id intel_pch_thermal_id[] = { > MODULE_DEVICE_TABLE(pci, intel_pch_thermal_id); > > static const struct dev_pm_ops intel_pch_pm_ops = { > - .suspend = intel_pch_thermal_suspend, > + .suspend_noirq = intel_pch_thermal_suspend_noirq, IMO it would be better to put this change into a separate patch and reorder the other changes after this one. It is valid by itself. > .resume = intel_pch_thermal_resume, > }; > > --