On 1/4/24 17:06, Paul E. McKenney wrote:
Although I am happy to have been able to locate the commit (and even
happier that Sean spotted the problem and that you quickly pushed the
fix to mainline!), chasing this consumed a lot of time and systems over
an embarrassingly large number of months. As in I first spotted this
bug in late July. Despite a number of increasingly complex attempts,
bisection became feasible only after the buggy commit was backported to
our internal v5.19 code base. 🙁
Yes, this strikes two sore points.
One is that I have also experienced being able to bisect only with a
somewhat more linear history (namely the CentOS Stream 9 aka c9s
frankenkernel [1]) and not with upstream. Even if the c9s kernel is not
a fully linear set of commits, there's some benefit from merge commits
that consist of slightly more curated set of patches, where each merge
commit includes both new features and bugfixes. Unfortunately, whether
you'll be able to do this with the c9s kernel depends a lot on the
subsystems involved and on the bug. Both are factors that may or may
not be known in advance.
The other, of course, is testing. The KVM selftests infrastructure is
meant for this kind of white box problem, but the space of tests that
can be written is so large, that there's always too few tests. It
shines when you have a clear bisection but an unclear fix (in the past I
have had cases where spending two days to write a test led me to writing
a fix in thirty minutes), but boosting the reproducibility is always a
good thing.
And please understand that I am not casting shade on those who wrote,
reviewed, and committed that buggy commit. As in I freely confess that
I had to stare at Sean's fix for a few minutes before I figured out what
was going on.
Oh don't worry about that---rather, I am going to cast a shade on those
that did not review the commit, namely me. I am somewhat obsessed with
Boolean logic and *probably* I would have caught it, or would have asked
to split the use of designated initializers to a separate patch. Any of
the two could, at least potentially, have saved you quite some time.
Instead, the point I am trying to make is that carefully
constructed tests can serve as tireless and accurate code reviewers.
This won't ever replace actual code review, but my experience indicates
that it will help find more bugs more quickly and more easily.
TBH this (conflict between virtual addresses on the host and the guest
leading to corruption of the guest) is probably not the kind of
adversarial test that one would have written or suggested right off the
bat. But it should be written now indeed.
Paolo
[1]
https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/