Re: Return value for "impossible" situations

"Valdis Klētnieks" <valdis.kletnieks@xxxxxx> · Sun, 25 Jul 2021 19:15:39 -0400

On Sun, 25 Jul 2021 13:07:21 -0500, Ian Pilcher said:

> Is there any sort of convention around what to return in the case of an
> error in the logic of the code itself, something that will make it as
> obvious as possible that the problem is a bug.

In general, there's no good way to signal such issues back to userspace,
because there's no reserved '-EHIT_A_BUG' value.  This has been true
for decades, ever since Unix was still on the 18-bit PDP-7 in 1969.

And it's basically useless to return such a value to userspace, because there's
nothing useful that userspace can *do* in such a case. Note that pretty much
all the defined error codes refer back to things that userspace could at least
potentially do something useful - it may retry the operation after a delay, or
tell the user that an optional facility isn't available in the currently running kernel,
or try an alternate method of doing an operation (for example, trying again
with IPv4 if an IPv6 connection fails, or use a different method of file locking).

But if a userspace process hits an actual kernel bug, what is it supposed to do
to recover?  Do you add "check for -EHIT_A_BUG' to every single place you do a
syscall?  After all, 98% of userspace code is, *at best*, going to simply do an
'if (!erro)' test.  And userspace code only does a more detailed check of *which*
errno it got handed if it can do something different/useful for a specific code (such
as code that goes into a retry loop if it gets -EEXIST when trying to create a
lock file that shouldn't exist, and should be removed by another process when
it's done).

In particular, userspace has no ability to log any useful debugging
information. There's the additional issue that the actual problem may not even
be in that syscall's code - it could be some previous syscall from the current
process that mis-set something in a structure, or some other kernel thread
doing a write-after-free and corrupting memory, or code that assumed that it
wouldn't be rescheduled to another CPU, or a myriad of other ways to fail.

The *proper* thing to do is, instead of deciding to return -EHIT_A_BUG, do a
WARN(), or BUG(), so that the dmesg has something that's at least potentially
useful.  Then use that information to fix the issue.
Attachment:
pgpxSMjlEqIvz.pgp

Description: PGP signature
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@xxxxxxxxxxxxxxxxx
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies