Re: [PATCH] Documentation: coding-style: don't encourage WARN*()

Laurent Pinchart <laurent.pinchart@xxxxxxxxxxxxxxxx> · Mon, 15 Apr 2024 11:25:29 +0300

Hi Greg,

On Mon, Apr 15, 2024 at 07:21:37AM +0200, Greg KH wrote:
> On Sun, Apr 14, 2024 at 10:48:35PM +0300, Laurent Pinchart wrote:
> > On Sun, Apr 14, 2024 at 12:08:50PM -0500, Alex Elder wrote:
> > > Several times recently Greg KH has admonished that variants of WARN()
> > > should not be used, because when the panic_on_warn kernel option is set,
> > > their use can lead to a panic. His reasoning was that the majority of
> > > Linux instances (including Android and cloud systems) run with this option
> > > enabled. And therefore a condition leading to a warning will frequently
> > > cause an undesirable panic.
> > > 
> > > The "coding-style.rst" document says not to worry about this kernel
> > > option.  Update it to provide a more nuanced explanation.
> > > 
> > > Signed-off-by: Alex Elder <elder@xxxxxxxxxx>
> > > ---
> > >  Documentation/process/coding-style.rst | 21 +++++++++++----------
> > >  1 file changed, 11 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/Documentation/process/coding-style.rst b/Documentation/process/coding-style.rst
> > > index 9c7cf73473943..bce43b01721cb 100644
> > > --- a/Documentation/process/coding-style.rst
> > > +++ b/Documentation/process/coding-style.rst
> > > @@ -1235,17 +1235,18 @@ example. Again: WARN*() must not be used for a condition that is expected
> > >  to trigger easily, for example, by user space actions. pr_warn_once() is a
> > >  possible alternative, if you need to notify the user of a problem.
> > >  
> > > -Do not worry about panic_on_warn users
> > > -**************************************
> > > +The panic_on_warn kernel option
> > > +********************************
> > >  
> > > -A few more words about panic_on_warn: Remember that ``panic_on_warn`` is an
> > > -available kernel option, and that many users set this option. This is why
> > > -there is a "Do not WARN lightly" writeup, above. However, the existence of
> > > -panic_on_warn users is not a valid reason to avoid the judicious use
> > > -WARN*(). That is because, whoever enables panic_on_warn has explicitly
> > > -asked the kernel to crash if a WARN*() fires, and such users must be
> > > -prepared to deal with the consequences of a system that is somewhat more
> > > -likely to crash.
> > > +Note that ``panic_on_warn`` is an available kernel option. If it is enabled,
> > > +a WARN*() call whose condition holds leads to a kernel panic.  Many users
> > > +(including Android and many cloud providers) set this option, and this is
> > > +why there is a "Do not WARN lightly" writeup, above.
> > > +
> > > +The existence of this option is not a valid reason to avoid the judicious
> > > +use of warnings. There are other options: ``dev_warn*()`` and ``pr_warn*()``
> > > +issue warnings but do **not** cause the kernel to crash. Use these if you
> > > +want to prevent such panics.
> > 
> > Those options are not equivalent, they print a single message, which is
> > much easier to ignore. WARN() is similar to -Werror in some sense, it
> > pushes vendors to fix the warnings. I have used WARN() in the past to
> > indicate usage of long-deprecated APIs that we were getting close to
> > removing for instance. dev_warn() wouldn't have had the same effect.
> 
> If you want to reboot a box because someone called an "improper" api,

I don't "want" to reboot. It came as a side effect when panic_on_warn
was added, and worsened when its adoption increased. I won't argued for
or against panic_on_warn, but WARN() serves some use cases today that I
consider valid. If we want to discourage its usage, we need another API
to cover those use cases.

> then sure, use WARN(), but that feels like a really bad idea.  Just
> remove the api and fix up all in-kernel users instead.  Why wait?

There are multiple use cases. One of them is to make sure no new user of
the old, deprecated behaviour is introduced. This is especially
important when driver development spans multiple kernel releases, the
development can start before the API behaviour changes, with the driver
merged after the API change. This is something we've done multiple times
in V4L2.

> If you want to show a traceback, then just print that out, but I've seen
> that totally ignored as well, removing the api is usually the only way
> to get people to actually notice, as then their builds break.

Does your experience tell that tracebacks are routinely ignored during
development too, not just in production ?

-- 
Regards,

Laurent Pinchart