On Fri, Jan 31, 2020 at 11:17 AM Dan Carpenter <dan.carpenter@xxxxxxxxxx> wrote: > > On Fri, Jan 31, 2020 at 05:05:10PM +0800, Hillf Danton wrote: > > > > On Fri, 31 Jan 2020 08:06:52 +0300 Dan Carpenter wrote: > > > We changed this from dev_err() to dev_WARN() in commit 0cb54a3e47cb > > > ("USB: debugging code shouldn't alter control flow"). > > > > > > The difference between dev_WARN() and dev_err() is that dev_WARN() > > > prints a stack trace and if you have panic on OOPS enabled then it leads > > > to a panic. The dev_err() function just prints the error message. > > > > > > Back in the day we didn't have usb emulators fuzz testing the kernel > > > so dev_WARN() didn't cause a problem for anyone, but these days the > > > > Another free option is perhaps to keep the devoted bot agile if it's > > difficult to list anybody who was mauled by its articulate reports. > > It's difficult to parse this email. I get that you're being sarcastic > but I can't tell what you're being sarcastic about. :P > > I think you're basically saying that syzbot should maintain a white > list of ignored Oopses. There are two problems with this: 1) Other > people run syzbot so everyone has to run into this bug and then add it > to their own white list. 2) If the kernel OOpes here then we cannot > test what happens next so it could be hiding bugs. > > One idea is that there could be a kernel function which generates a > stack trace but is not an Oops. Yes, this is needed for any kernel testing: not just syzbot, and not just syzkaller, and not just fuzzing, literally for any kernel testing. We need a way to easily distinguish between kernel bugs and not bugs in a black-and-white manner. Otherwise, of practical options testing systems can either ignore lots of kernel bugs during testing (unfortunately I see this happening, if it does not panic the system, it's being ignored); or attach a human expert in each system to read logs of each test run to sort kernel messages into bugs and non-bugs. Both of these are not good options. This is not about stack traces. There is already a function that does this (print_stack() or something), and it's fine to print stacks (if necessary, it produces lots of output so should not be taken lightly). The way syzkaller detects these now is by "WARNING:" prefix. This is also not about the exact way these are spelled out. We could make "WARNING:" mean an invalid user input and use something else for kernel bugs. But it just seems that WARNING===kernel bug is a really good starting point (lots of debugging tools use it). Especially taking into account the general recommendation of "don't panic kernel if it can proceed", as the result lots of BUGs (in the sense that these are kernel bugs) were turned into WARNINGs (or written out as WARNINGs initially). Otherwise BUG would be a good marking for bugs, except that it's not recommended to use in most cases. I see lots of people also mention panic_on_warn in the context of these reports. panic_on_warn here is only a red herring. It really does not change anything. We could remove it, but still report WARNINGs. But syzkaller also reports some things that don't panic anyway. This is really about the criteria for kernel bug vs non-bug (something that needs to be reported or not). I would assume this criteria is also important for kernel developers (people reading/writing code), especially for new subsystems. If I see a WARNING in code (or just any kind of assertion), it's useful to know whether it's something that can't happen under any circumstances and if it happens it's really a serious logical flow somewhere; versus just invalid input/rare/unexpected runtime condition. The first category can be used as a basis for building my understanding of the code. Not being able to understand type of assertion confuses you both ways: if you think it's something that can't happen but at the same time you see a way it can be violated, you question your understanding of the code. Or the other way around, you are trying to figure out the way how condition can become true, but you can only conclude that it's always false, you also question your understanding of the code. There were proposals to add a parallel set of macros, say, NOTICE (name is obviously discussable). That would print something different from WARNING: and optionally with a stack trace. Maybe also a parallel set of _once and _ratelimited versions (esp since these can be triggered). These would very unambiguously mean "this can be triggered by users/devices; on low-memory, etc". But there was not enough interest at the time and the discussion quickly died. Maybe it's time to revive it. You can imagine some value-added features on top: e.g. command line/sysctl option to disable that output entirely, or reduce it to 1 line (no stack), if there are too many of these (we know these can be triggered!), or if nobody is simply looking for them anyway (true for most users). +Eric, Steve, who got these WARNING-not-a-kernel-bug reports recently too