> > > > Can the KVM maintainers take a look at this? This doesn't have anything to do > > > > with my commit that syzbot bisected it to. > > > > > > > > +Dmitry, statistics lession: if a crash occurs only 1 in 10 times, as was the > > > > case here, then often it will happen 0 in 10 times by chance. syzbot needs to > > > > run the reproducer more times if it isn't working reliably. Otherwise it ends > > > > up blaming some random commit. > > > > > > Added a note to https://github.com/google/syzkaller/issues/1051 > > > Thanks > > > > As we increase number of instances, we increase chances of hitting > > unrelated bugs. E.g. take a look at the bisection log for: > > https://syzkaller.appspot.com/bug?extid=f14868630901fc6151d3 > > What is the optimum number of tests is a good question. I suspect that > > the current 10 instances is close to optimum. If we use significantly > > more we may break every other bisection on unrelated bugs... > > > > Only because syzbot is being super dumb in how it does the bisection. AFAICS, > in the example you linked to, buggy kernels reliably crashed 10 out of 10 times > with the original crash signature, "WARNING in cgroup_exit". Then at some point > it tested some kernel without the bug and got a different crash just 1 in 10 > times, "WARNING: ODEBUG bug in netdev_freemem". > > The facts that the crash frequency was very different, and the crash signature > was different, should be taken as a very strong signal that it's not the bug > being bisected for. And this is something easily checked for in code. > > BTW, I hope you're treating fixing this as a high priority, given that syzbot is > now sending bug reports to kernel developers literally selected at random. This > is a great way to teach people to ignore syzbot reports. (When I suggested > bisection originally, I had assumed you'd implement some basic sanity checks so > that only bisection results likely to be reliable would be mailed out.) While I believe we can get some quality improvement by shuffling numbers. I don't think we can get significant improvement overall and definitely not eliminate wrong bisection results entirely. It's easy to take a single wrong bisection and design a system around this scenario, but it's very hard to design a system that will handle all of them in all generality. For example, look at these bisection logs for cases where reproduction frequency varies from 1 to all, but that's still the same bug: https://syzkaller.appspot.com/x/bisect.txt?x=12df1ba3200000 https://syzkaller.appspot.com/x/bisect.txt?x=10daff1b200000 https://syzkaller.appspot.com/x/bisect.txt?x=1592b037200000 https://syzkaller.appspot.com/x/bisect.txt?x=11c610a7200000 https://syzkaller.appspot.com/x/bisect.txt?x=17affd1b200000 You also refer to "a different crash". But that's not a predicate we can have. And definitely not something that is "easily checked for in code". Consider, a function rename anywhere in the range will lead to as if a different crash. If you look at all bisection logs you find lots of amusing cases where something that a program may consider a different bugs is actually the same bug, or the other way around. So if we increase number of tests and we don't have a way to distinguish crashes (which we don't), we will necessary increase incorrect results due to unrelated bugs. Bisection is a subtle process and the predicate, whatever logic it does internally, in the end need to produce a single yes/no. And a single wrong answer in the chain leads to a completely incorrect result. There are some fundamental reasons for wrong results: - hard to reproduce bugs (not fixable) - unrelated bugs/broken builds (fixable) While tuning numbers can pepper over these to some degree (maybe), these reasons will stay and will lead to incorrect results. Also I don't this tuning as something that is trivially to do as you suggest. For example, how exactly do you assess a crash as reliably happening vs episodically? How exactly do you choose number of tests for each case? Choosing too few tests will lead to incorrect results, choosing too many will lead to incorrect results. How exactly do you assess that something that was happening reliably now does not happen reliably? How do you assess that a crash is very different? Each of the choices have chances of producing more bad results, so one would need to rerun hundreds of bisections with old/new version, and then manually mark results and then estimate quality change (which most likely will be flaky or inconclusive in lots of cases). Tuning quality of heuristics-based algorithms is very time consuming, especially if each experiment takes weeks. There is another down-side for not "super dumb" algorithms. Which is explaining results. Consider that syzbot now mails a bisection where the crash happened and a developer sees that it's the same crash, but syzbot says "nope. did not crash". That will cause reasonable questions and somebody (who would that be?) will need to come and explain what happens and why, and how that counter-intuitive local result was shown to improve quality overall. Simpler algorithms are much easier to explain. I consider bisection as high priority, but unfortunately only among other high priority and very high priority work. Besides work on the fuzzer itself and bug detection tools, we now test 15 kernels across 6 different OSes. Operational work can't be deprioritized because then nothing will work at all. Change reviews can't be deprioritized. Overseeing bug flow can't be deprioritized. Updating crash parsing in response to new kernel output can't be deprioritized. Answering all human emails can't be deprioritized.