Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

Nishanth Menon <nm@xxxxxx> · Thu, 31 Dec 2015 11:47:57 -0600

On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
> can we have a shorter title?
> 
> On Tue, Dec 29, 2015 at 02:46:49PM +0530, Keerthy wrote:
>> Hi Nishanth,
>>
> 
> <cut> 
>>>
>>> I am not sure if this #ifdeffery is even needed.
>>>
>>>
>>> Eduardo, Rui: If this is not the suggested technique, maybe you guys
>>> could suggest how we could handle a case where userspace might be
>>> hungup due to some reason and a case where a critical temperature
>>> event in the middle of device probe was triggered?
> 
> Orderly power off is supposed to take care of this. Looking at the code,
> it will force a shutdown in case execution of userland command fails:
> 
> static int __orderly_poweroff(bool force)
> {
>         int ret;
> 
>         ret = run_cmd(poweroff_cmd);
> 
>         if (ret && force) {
>                 pr_warn("Failed to start orderly shutdown: forcing the issue\n");
> 
>                 /*
>                  * I guess this should try to kick off some daemon to sync and
>                  * poweroff asap.  Or not even bother syncing if we're doing an
>                  * emergency shutdown?
>                  */
>                 emergency_sync();
>                 kernel_power_off();
>         }

Yes, it will *IF* userspace fails. the condition that I had tracked
was before identifying the following fix[1] - Example fail is here[2]

In this case, tmp102 is setup for X15 as [3] - and built as a module.
as the kernel startsup filesystem and starts a modprobe of all modules
via udev rules, the probe of tmp102 detects (falsely) a critical
temperature condition. Shutdown attempt in the middle of driver probe
is always a tricky business.

As we look at the log in [2], Line  472
> thermal thermal_zone3: critical temperature reached(108 C),shutting down
We have userspace trigger for shutdown taking place.

Line 495: INIT: Sending processes the TERM signal

userspace starts shutting down services. (but note that probe for
other devices were either in progress or queued up to complete)..

at line 647 - we are in a weird place -> sysrq shows that system is
idled and userspace is shutdown and system is still active.

In this case, we entered the case thanks to a driver bug, but if this
situation was a real world temperature scenario, then we'd probably in
an overtemp scenario, then device damage could take place OR something
much worse.

The only alternative is to run a parallel thread in case userspace
fails to complete the job in some given period of time - due to what
ever be the condition triggering the problem.

I hope this explains the problem.

[1]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
[2] http://pastebin.ubuntu.com/14326688/

[3]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738

-- 
Regards,
Nishanth Menon
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html