On Fri, 18 Oct 2019 13:11:54 -0300, Martin Galvan said: > I don't think I was clear. My intent is that if a pointer bug isn't > fixed, my module will fail gracefully and go through the catch block > instead of panicking the whole system. Well..here's the thing. Unless you have "panic_on_oops" set, hitting a null pointer will usually *NOT* panic the whole system. In fact, that #0000 in the panic message is a counter of how many times the kernel has OOPs'ed already. Way back in the dark mists of time, I had a system that managed to get it up to #1500 or so overnight. The problem is that at that point, a generic "fail gracefully" isn't really an option. The most graceful generic thing the kernel can do at that point is kill the execution thread that hit the error. This can quickly go sideways if that thread held a lock or similar critical resource. And no, even though the kernel knows all the locks the thread had, it *does not* know which ones, if any, are safe to unlock. The answer is probably "none of them", because locks are usually around the smallest amount of code possible, so if the lock was held, it's probably unsafe to break it. The few places in the kernel that do lock-breaking are basically all things like the printk/sysrq code that is basically a "we're dead anyhow, try to get some logging info". I've seen systems that manage to get the load average up to 17,000 or so, because process after process got into 'D' state because they tried to do filesystem I/O to a filesystem that had a lock wedged when a process oopsed. (A good reason for production systems to have lots of filesystems - if a kernel bug causes the /apps/database/logs filesystem to hang, you can probably reboot and recover because /apps/database/replay is synced to disk, and you have lots of stuff in /var that's got forensic info in it. Use one big filesystem, and when it locks up, you're immediately dead in the water. Might not even be able to ssh in, because that hangs because it writes to /var which is wedged along with everything else....) And if you actually *think* about it - a 'try/catch' is semantically *identical* to coding a parameter test before the event or checking a return code after. (A good place to interject Tom Duff's First Law of Systems Programming - "Never test for an error condition you don't know how to handle". Note in this context, "kill the thread and pray" means you *do* know how to handle it - by killing the thread...) Also - say you have a try/catch around a statement. For some exceptions, such as an end-of-file or a dropped network connection, it's reasonably easy to know how to clean up and continue. But what if the statement hits a null pointer error. What do you do to clean things up? You have a bad pointer, and you have *no way to actually fix it and continue normally*. And don't get me started on try/catch/throw - that's got even *more* land mines. :)
Attachment:
pgpMtlSibfloW.pgp
Description: PGP signature
_______________________________________________ Kernelnewbies mailing list Kernelnewbies@xxxxxxxxxxxxxxxxx https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies