On Thu, Jan 04, 2024 at 05:32:34PM +0100, Paolo Bonzini wrote: > On 1/4/24 17:06, Paul E. McKenney wrote: > > Although I am happy to have been able to locate the commit (and even > > happier that Sean spotted the problem and that you quickly pushed the > > fix to mainline!), chasing this consumed a lot of time and systems over > > an embarrassingly large number of months. As in I first spotted this > > bug in late July. Despite a number of increasingly complex attempts, > > bisection became feasible only after the buggy commit was backported to > > our internal v5.19 code base. 🙁 > > Yes, this strikes two sore points. > > One is that I have also experienced being able to bisect only with a > somewhat more linear history (namely the CentOS Stream 9 aka c9s > frankenkernel [1]) and not with upstream. Even if the c9s kernel is not a > fully linear set of commits, there's some benefit from merge commits that > consist of slightly more curated set of patches, where each merge commit > includes both new features and bugfixes. Unfortunately, whether you'll be > able to do this with the c9s kernel depends a lot on the subsystems involved > and on the bug. Both are factors that may or may not be known in advance. I guess I am glad that it is not just me. ;-) > The other, of course, is testing. The KVM selftests infrastructure is meant > for this kind of white box problem, but the space of tests that can be > written is so large, that there's always too few tests. It shines when you > have a clear bisection but an unclear fix (in the past I have had cases > where spending two days to write a test led me to writing a fix in thirty > minutes), but boosting the reproducibility is always a good thing. Agreed, validation never will be perfect, and so improving the test suite based on production experience is a good thing, as is creating test cases based on the behavior of important production workloads for those who run them. > > And please understand that I am not casting shade on those who wrote, > > reviewed, and committed that buggy commit. As in I freely confess that > > I had to stare at Sean's fix for a few minutes before I figured out what > > was going on. > > Oh don't worry about that---rather, I am going to cast a shade on those that > did not review the commit, namely me. I am somewhat obsessed with Boolean > logic and *probably* I would have caught it, or would have asked to split > the use of designated initializers to a separate patch. Any of the two > could, at least potentially, have saved you quite some time. We have all done similar things. I certainly have! > > Instead, the point I am trying to make is that carefully > > constructed tests can serve as tireless and accurate code reviewers. > > This won't ever replace actual code review, but my experience indicates > > that it will help find more bugs more quickly and more easily. > > TBH this (conflict between virtual addresses on the host and the guest > leading to corruption of the guest) is probably not the kind of adversarial > test that one would have written or suggested right off the bat. But it > should be written now indeed. Very good, looking forward to seeing it! Thanx, Paul > Paolo > > [1] > https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/ >