Re: [BUG] Guest OSes die simultaneously (bisected)

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Thu, 4 Jan 2024 17:32:34 +0100

On 1/4/24 17:06, Paul E. McKenney wrote:
Although I am happy to have been able to locate the commit (and even
happier that Sean spotted the problem and that you quickly pushed the
fix to mainline!), chasing this consumed a lot of time and systems over
an embarrassingly large number of months.  As in I first spotted this
bug in late July.  Despite a number of increasingly complex attempts,
bisection became feasible only after the buggy commit was backported to
our internal v5.19 code base.  🙁

Yes, this strikes two sore points.

One is that I have also experienced being able to bisect only with a 
somewhat more linear history (namely the CentOS Stream 9 aka c9s 
frankenkernel [1]) and not with upstream.  Even if the c9s kernel is not 
a fully linear set of commits, there's some benefit from merge commits 
that consist of slightly more curated set of patches, where each merge 
commit includes both new features and bugfixes.  Unfortunately, whether 
you'll be able to do this with the c9s kernel depends a lot on the 
subsystems involved and on the bug.  Both are factors that may or may 
not be known in advance.

The other, of course, is testing.  The KVM selftests infrastructure is 
meant for this kind of white box problem, but the space of tests that 
can be written is so large, that there's always too few tests.  It 
shines when you have a clear bisection but an unclear fix (in the past I 
have had cases where spending two days to write a test led me to writing 
a fix in thirty minutes), but boosting the reproducibility is always a 
good thing.

And please understand that I am not casting shade on those who wrote,
reviewed, and committed that buggy commit.  As in I freely confess that
I had to stare at Sean's fix for a few minutes before I figured out what
was going on.

Oh don't worry about that---rather, I am going to cast a shade on those 
that did not review the commit, namely me.  I am somewhat obsessed with 
Boolean logic and *probably* I would have caught it, or would have asked 
to split the use of designated initializers to a separate patch.  Any of 
the two could, at least potentially, have saved you quite some time.

Instead, the point I am trying to make is that carefully
constructed tests can serve as tireless and accurate code reviewers.
This won't ever replace actual code review, but my experience indicates
that it will help find more bugs more quickly and more easily.

TBH this (conflict between virtual addresses on the host and the guest 
leading to corruption of the guest) is probably not the kind of 
adversarial test that one would have written or suggested right off the 
bat.  But it should be written now indeed.

Paolo

[1] 
https://www.theregister.com/2023/06/30/enterprise_distro_feature_devconf/