Re: scsi: BUG in scsi_init_io

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 19 Feb 2017 09:38:26 -0800

On Tue, Jan 31, 2017 at 7:41 AM, James Bottomley
<jejb@xxxxxxxxxxxxxxxxxx> wrote:
>
> It is a kernel bug and it should not be user triggerable, so it should
> have a warn_on or bug_on.

Hell NO.

Christ, James, listen to yourself. What you are basically saying when
you say it should be a BUG_ON() is

 "This shouldn't happen, but if it ever does happen, let's just turn
our mistaken assumptions into a dead machine that is really hard to
debug".

Because a BUG_ON() effectively kills the machine if the call chain has
some locks held. In the SCSI layer, that generally means that there
will be no logged oops either, because any locks held likely just
killed your filesystem or disk subsystem, so now that oops is
basically not even likely to be reported by most normal users.

So stop this "should have a bug_on". In fact, since this apparently
_is_ easily user-triggerable, it damn well shouldn't have a warn_on
either. At most, a WARN_ON_ONCE(), so that we might get reports of
_what_ the bad call chain is, but we will never kill the machine and
we will *not* give people the ability to randomly spam the system
logs.

BUG_ON() needs to die. People need to realize that it is a _problem_,
and that it makes any bugs _worse_. Don't do it.

The only valid reason for BUG_ON() is when some very core data
structure is _so_ corrupt that you can't even continue, because you
simply can't even return an error and there's no way for you to just
say "log it once and continue".

And by that I don't mean some random value you have in a request. I
mean literally "this is a really core data structure, and I simply
_cannot_ continue" (where that "cannot" is about an actual physical
impossibility, not a "I could continue but I think this is serious").

Anything else is a "return error, possibly with a WARN_ON() to let
people know that bad things are going on".

Basically, BUG_ON() should be in core kernel code. Not in drivers. And
even in core kernel code, it's likely wrong.

             Linus