On Fri, 2021-04-09 at 10:08 +0300, Matti Vaittinen wrote: > On Thu, 2021-04-08 at 20:20 -0700, Kees Cook wrote: > > On Wed, Apr 07, 2021 at 03:50:15PM +0300, Andy Shevchenko wrote: > > > On Wed, Apr 7, 2021 at 12:49 PM Vaittinen, Matti > > > <Matti.Vaittinen@xxxxxxxxxxxxxxxxx> wrote: > > > > On Wed, 2021-04-07 at 12:10 +0300, Andy Shevchenko wrote: > > > > > On Wed, Apr 7, 2021 at 8:02 AM Matti Vaittinen > > > > > <matti.vaittinen@xxxxxxxxxxxxxxxxx> wrote: > > > > > > On Wed, 2021-04-07 at 01:44 +0300, Andy Shevchenko wrote: > > > > > > > On Tuesday, April 6, 2021, Matti Vaittinen < > > > > > > > matti.vaittinen@xxxxxxxxxxxxxxxxx> wrote: > > > > > > > > + BUG(); > > > > > > > > +} > > > > This, though, are you sure you want to use BUG()? Linus gets upset > > about > > such things: > > https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on > > > > I see. I am unsure of what would be the best action in the regulator > case we are handling here. To give the context, we assume here a > situation where power has gone out of regulation and the hardware is > probably failing. First countermeasure to protect what is left of HW > is > to shut-down the failing regulator. BUG() was called here as a last > resort if shutting the power via regulator interface was not > implemented or working. > > Eg, we try to take what ever last measure we can to minimize the HW > damage - and BUG() was used for this in the qcom driver where I stole > the idea. Judging the comment related to BUG() in asm-generic/bug.h > > /* > * Don't use BUG() or BUG_ON() unless there's really no way out; one > > * example might be detecting data structure corruption in the middle > * > of an operation that can't be backed out of. If the (sub)system > * can > somehow continue operating, perhaps with reduced functionality, > * it's > probably not BUG-worthy. > * > * If you're tempted to BUG(), think > again: is completely giving up > * really the *only* solution? There > are usually better options, where > * users don't need to reboot ASAP and > can mostly shut down cleanly. > */ > https://elixir.bootlin.com/linux/v5.12-rc6/source/include/asm-generic/bug.h#L55 > > this really might be valid use-case. > > To me the real question is what happens after the BUG() - and if > there > is any generic handling or if it is platform/board specific? Does it > actually have any chance to save the HW? > > Mark already pointed that we might need to figure a way to punt a > "failing event" to the user-space to initiate better "safety > shutdown". > Such event does not currently exist so I think the main use-case here > is to do logging and potentially prevent enabling any further actions > in the failing HW. > > So - any better suggestions? > Maybe we should take same approach as is taken in thermal_core? Quoting the thermal documentation: "On an event of critical trip temperature crossing. Thermal framework allows the system to shutdown gracefully by calling orderly_poweroff(). In the event of a failure of orderly_poweroff() to shut down the system we are in danger of keeping the system alive at undesirably high temperatures. To mitigate this high risk scenario we program a work queue to fire after a pre-determined number of seconds to start an emergency shutdown of the device using the kernel_power_off() function. In case kernel_power_off() fails then finally emergency_restart() is called in the worst case." Maybe this 'hardware protection, in-kernel, emergency HW saving shutdown' - logic, should be pulled out of thermal_core.c (or at least exported) for (other parts like) the regulators to use? I don't like the idea relying in the user-space to be in shape it can handle the situation. I may be mistaken but I think a quick action might be required. Hence the in-kernel handling does not sound so bad to me. I am open to all education and suggestions. Meanwhile I am planning to just convert the BUG() to WARN(). I don't claim I know how BUG() is implemented on each platform - but my understanding is that it does not guarantee any power to be cut but just halts the calling process(?). I guess this does not guarantee what happens next - maybe it even keeps the power enabled and end up just deadlocking the system by reserved locks? I think thermal guys have been pondering this scenario for severe temperature protection shutdown so I would like to hear your opinions. Best Regards Matti Vaittinen