Re: Regressions in stable releases

Willy Tarreau <w@xxxxxx> · Fri, 6 Aug 2021 16:42:44 +0200

On Fri, Aug 06, 2021 at 07:16:32AM -0700, Guenter Roeck wrote:
> On 8/5/21 11:30 AM, Willy Tarreau wrote:
> > On Thu, Aug 05, 2021 at 10:29:49AM -0700, Guenter Roeck wrote:
> > > > It looks like a typical "works for me" regression. The best thing that
> > > > could possibly be done to limit such occurrences would be to wait "long
> > > > enough" before backporting them, in hope to catch breakage reports before
> > > > the backport, but here there were already 3 weeks between the patch was
> > > > submitted and it was backported.
> > > 
> > > No. The patch is wrong. It just _looks_ correct at first glance.
> > 
> > So that's the core of the problem. Stable maintainers cannot be tasked
> > to try to analyse each patch in its finest details to figure whether a
> > maintainer that's expected to be more knowledgeable than them on their
> > driver did something wrong.
> > 
> > Then in my opinion we should encourage *not* to use "Fixes:" on untested
> > patches (untested patches will always happen due to hardware availability
> > or lack of a reliable reproducer).
> > 
> > What about this to try to improve the situation in this specific case ?
> > 
> > Willy
> > 
> > 
> > > From ef646bae2139ba005de78940064c464126c430e6 Mon Sep 17 00:00:00 2001
> > From: Willy Tarreau <w@xxxxxx>
> > Date: Thu, 5 Aug 2021 20:24:30 +0200
> > Subject: docs: process: submitting-patches.rst: recommend against 'Fixes:' if
> >   untested
> > 
> > 'Fixes:' usually is taken as authority and often results in a backport. If
> > a patch couldn't be tested although it looks perfectly valid, better not
> > add this tag to leave a final chance for backporters to ask about the
> > relevance of the backport or to check for future tests.
> > 
> > Cc: Guenter Roeck <linux@xxxxxxxxxxxx>
> > Signed-off-by: Willy Tarreau <w@xxxxxx>
> > ---
> >   Documentation/process/submitting-patches.rst | 9 +++++++++
> >   1 file changed, 9 insertions(+)
> > 
> > diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst
> > index 0852bcf73630..54782b0e2f4c 100644
> > --- a/Documentation/process/submitting-patches.rst
> > +++ b/Documentation/process/submitting-patches.rst
> > @@ -140,6 +140,15 @@ An example call::
> >   	$ git log -1 --pretty=fixes 54a4f0239f2e
> >   	Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the number of pages it actually freed")
> > +Please note that a 'Fixes:' tag will most often result in your patch being
> > +automatically backported to stable branches. If for any reason you could not
> > +test that it really fixes the problem (for example, because the bug is not
> > +reproducible, or because you did not have access to the required hardware
> > +at the time of writing the patch to verify it does not cause regressions),
> > +even if you are absolutely certain of your patch's validity, do not include
> > +a 'Fixes:' tag and instead explain the situation in the commit message in
> > +plain English.
> > +
> 
> I don't think that would be a good idea, First, it would discourage people
> from using Fixes: tags, and second it would not really solve the problem either.
> 
> While I am sure that the patch in question wasn't really tested (after all,
> it broke EC communication on all Mediatek Chromebooks using SPI to communicate
> with the EC), we don't know what the author tested. It may well be (and is quite
> likely) that the author _did_ test the patch, only not the affected code path.
> That means the author may still have added a Fixes: tag in the wrong belief
> to have tested the fix, even with the above restrictions in place.

But this is exactly the principle of a regression, and the first rule that
evreyone learns in programming is that 100% coverage never exists. That's
particularly true in large projects where the skills are distributed over
many people. Everyone faithfully tests the faulty case (i.e. I tested that
it does solve the problem I could reproduce), without a knowledge of plenty
of other cases. I do cause regressions myself when fixing bugs, despite
running plenty of regtests, and I hate this. The only thing I can do in
this case is to communicate about it as soon as detected and produce another
fix as fast as possible.

> No, I think all we can do is to improve test coverage.

Improving test coverage is a way to *reduce* the frequency of regressions,
it *never* eliminates them. And I'd really like that people start to accept
that they are part of a normal development process, as careful as it can
be, and that it's precisely why deploying updates from least sensitive to
most sensitive environments is extremely important, and why bug reports
are critical to the overall stability. There's never a zero-risk software
nor their is any zero-risk upgrade. It's probably convenient to some users
to complain that "the kernel maintainers" broke their system during an
update between two stable releases, but this is just looking at the single
failure in the middle of thousands of fixes and the fact that they maybe
didn't even prepare the necessary rollback plan that must come with *any*
software upgrade. And given that the alternative is to keep all their bugs
unfixed, I find this a bit exagerated sometimes.

I noticed that among people who I've seen complaining about regressions in
-stable that extremely rare are those who faced more than one in a major
branch. Sure some are more painful than others, but that's still an extremely
high success rate.

Now maybe there are *classes* of bugs that could be better detected through
automated testing (new platforms, new workloads etc). But I strongly doubt
it will make whiners complain less given that their use cases are probably
mostly widely covered already and still slip through the rare cracks. And
they are likely not the ones reporting problems here in the first place.

Cheers,
Willy