Re: Try/catch for modules?

"Valdis Klētnieks" <valdis.kletnieks@xxxxxx> · Fri, 18 Oct 2019 17:53:12 -0400

On Fri, 18 Oct 2019 13:11:54 -0300, Martin Galvan said:

> I don't think I was clear. My intent is that if a pointer bug isn't
> fixed, my module will fail gracefully and go through the catch block
> instead of panicking the whole system. 

Well..here's the thing.  Unless you have "panic_on_oops" set, hitting a null
pointer will usually *NOT* panic the whole system. In fact, that #0000 in the
panic message is a counter of how many times the kernel has OOPs'ed already.
Way back in the dark mists of time, I had a system that managed to get it up to
#1500 or so overnight.

The problem is that at that point,  a generic "fail gracefully" isn't really an option.

The most graceful generic thing the kernel can do at that point is kill the execution
thread that hit the error.  This can quickly go sideways if that thread held a lock
or similar critical resource.  And no, even though the kernel knows all the locks
the thread had, it *does not* know which ones, if any, are safe to unlock. The
answer is probably "none of them", because locks are usually around the smallest
amount of code possible, so if the lock was held, it's probably unsafe to break it. The
few places in the kernel that do lock-breaking are basically all things like the printk/sysrq
code that is basically a "we're dead anyhow, try to get some logging info".

I've seen systems that manage to get the load average up to 17,000 or so, because
process after process got into 'D' state because they tried to do filesystem I/O to
a filesystem that had a lock wedged when a process oopsed. (A good reason for
production systems to have lots of filesystems - if a kernel bug causes the /apps/database/logs
filesystem to hang, you can probably reboot and recover because /apps/database/replay
is synced to disk, and you have lots of stuff in /var that's got forensic info in it.  Use one
big filesystem, and when it locks up, you're immediately dead in the water.  Might not
even be able to ssh in, because that hangs because it writes to /var which is wedged
along with everything else....)

And if you actually *think* about it - a 'try/catch' is semantically *identical* to
coding a parameter test before the event or checking a return code after.

(A good place to interject Tom Duff's First Law of Systems Programming - "Never test
for an error condition you don't know how to handle".  Note in this context, "kill the thread
and pray" means you *do* know how to handle it - by killing the thread...)

Also - say you have a try/catch around a statement.  For some exceptions, such
as an end-of-file or a dropped network connection, it's reasonably easy to know
how to clean up and continue. But what if the statement hits a null pointer
error.   What do you do to clean things up?   You have a bad pointer, and you
have *no way to actually fix it and continue normally*.

And don't get me started on try/catch/throw - that's got even *more* land mines. :)

Attachment:
pgpMtlSibfloW.pgp

Description: PGP signature
_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@xxxxxxxxxxxxxxxxx
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies