On Tue, Oct 4, 2016 at 12:03 AM, Raymond Jennings <shentino@xxxxxxxxx> wrote: > On Mon, Oct 3, 2016 at 9:12 PM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: >> >> >> Killing the machine is ok if we have a situation where there literally >> is no other choice. > > For the curious: > > This would include situations like > > 1. The kernel is confused and further processing would result in undefined > behavior (like bluesmoke detecting PCC for example) Yes. Mainly situations where you cannot even have sane error handling, so you can't just do "warn and return without doing anything". It might be some "I noticed that the CPU stack is corrupt, I can't even return, I will just have to terminate". > 2. Security hazards where we'd leak stuff if we don't shut down. Honestly, I can't think of a situation where that has actually happened. Now, sometimes BUG_ON() is less onerous than other time: if you know that you are in a regular process context where you're not holding core locks, BUG_ON() is actually fairly benign: it will print a big scary message, and then it will kill the current process. It's not going to kill the machine, unless the admin has explicitly asked for "reboot if you have issues", which is mainly a situation for the googles of the world - if you have millions of machines and you don't actually *care*, then rebooting is fine. So realistically, the main places you should use BUG_ON() variants is (a) development code where it replaces error handling that you just haven't written yet, and you haven't really thought through all the possibilities, so you're saying "this can't happen, I'll fix it later". It sounds like Andrew thought that that is what VM_BUG_ON() is, and that it wouldn't be enabled unless you're a developer. But no, this is a "RFC patch" kind of situation. This kind of BUG_ON() often ends up escaping into the wild, but it should be after *huge* amounts of testing, and by definition it should never have been accepted during anythign but the merge window. So in a very real sense it's really my bad for not reacting to the BUG_ON() being added during rc8. (b) very core code that actually verifies some very core assumptions that are *so* important that if they are broken the code is by definition not really able to function. This is the actual intended case. It's not a "let's check that everybody did things right", it's a "this is a major design rule in this core code". The example in workingset_node_shadows_dec() _could_ actually have been that kind of (b) situation, except for the timing and lack of deep testing. But a reasonable example of (b) would be something like the BUG_ON(!PageLocked(page)); kind of code in fs/buffer.c - it's core infrastructure that has been tested with core code, and the BUG_ON() is meant to catch bad _new_ users quickly. And it's *such* a core requirement that error handling doesn't even make sense. Again, workingset_node_shadows_dec() could have a BUG_ON() in theory. But the BUG_ON() is _wrong_ when we had a situation of "oh, we just recently noticed a bug in this area, so lets' just verify that it's really gone". Notice? Just the timing and intent can make the difference between "good BUG_ON() in solid code that has been around forever" and "bad BUG_ON() checking something that we know we might be getting wrong". Linus -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html