On 8/9/22 15:34, Zev Weiss wrote:
On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
On 2022-08-09 22:56, Zev Weiss wrote:
On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
Hi,
[1.] One line summary of the problem:
NCT6775: suspend doesn't work after updating to Linux 5.19
[2.] Full description of the problem/report:
After updating my kernel from 5.18.11 to 5.19, I've noticed that resuming after suspend no longer works: fans start up, then about a second later, the computer just shuts down, leaving the USB ports powered up (normally it turns them off on shutdown). The screens don't turn on during this timeframe, so I can't see any useful log messages.
Bisecting between 5.18 (where it still worked) and 5.19 lead me to commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon: (nct6775) Split core and platform driver" which looks like a refactor commit, but apparently it broke something.
Hi Zoltán,
Thanks for the thorough bug report. You're right that that commit was essentially just a refactor, though there was one slight change to the nct6775_suspend() function introduced during the review process that may perhaps have had some subtle unintended side-effects.
Can you test the following patch and report if it resolves the problem?
Thanks,
Zev
Hi Zev,
Thanks for the quick reply. Yes, it looks like your patch does solve the problem (I've applied it on top of 5.19 (after fighting with my mail client for a while) and suspended a few times, it's working so far).
Regards,
Zoltan
Great, thanks.
Guenter, it looks like nct6775_suspend() really does in fact need to use nct6775_update_device() instead of dev_get_drvdata(), though it's not immediately obvious to me why. Though given that the bulk of of the body of nct6775_update_device() is inside an 'if' block that might not necessarily execute every time, I also wonder if it might be vulnerable to exhibiting the same problem depending on timing.
It isn't obvious to me either except ... is it possible that nct6775_update_device()
was never called (ie that the 'sensors' command was never executed) ?
That might just possibly explain the problem because in that case the values
restored in the resume() function were actually never read from the chip.
Guenter
Zoltán, if you could try another experiment to try to gather some data on that -- with the patch from my previous email still applied, could you try suspending via:
$ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input && echo mem > /sys/power/state
and see if suspend is broken again? The idea being to read from the hwmon device's sensors and then immediately suspend so that the nct6775_update_device() call from nct6775_suspend() falls within the driver's 1.5 second cache window and hence skips most of the work that function does. If the same problem starts occurring again that way, it seems we've had essentially the same bug lurking for a while and the change in the patch you bisected to just made it happen consistently instead of unpredictably. If suspend/resume continues working however, then I'm quite mystified, because the only other thing that's happening in that function is a mutex lock/unlock.
Thanks,
Zev