Re: [PATCH v1] docs: handling-regressions: rework section about fixing procedures

Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> · Wed, 31 May 2023 20:36:26 +0100



On Mon, May 15, 2023 at 10:40:48AM +0200, Thorsten Leemhuis wrote:
> This basically rewrites the 'Prioritize work on fixing regressions'
> section of Documentation/process/handling-regressions.rst for various
> reasons. Among them: some things were too demanding, some didn't align
> well with the usual workflows, and some apparently were not clear enough
> -- and of course a few things were missing that would be good to have in
> there.
> 
> Linus for example recently stated that regressions introduced during the
> past year should be handled similarly to regressions from the current
> cycle, if it's a clear fix with no semantic subtlety. His exact
> wording[1] didn't fit well into the text structure, but the author tried
> to stick close to the apparent intention.
> 
> It was a noble goal from the original author to state "[prevent
> situations that might force users to] continue running an outdated and
> thus potentially insecure kernel version for more than two weeks after a
> regression's culprit was identified"; this directly led to the goal "fix
> regression in mainline within one week, if the issue made it into a
> stable/longterm kernel", because the stable team needs time to pick up
> and prepare a new release. But apparently all that was a bit too
> demanding.
> 
> That "one week" target for example doesn't align well with the usual
> habits of the subsystem maintainers, which normally send their fixes to
> Linus once a week; and it doesn't align too well with stable/longterm
> releases either, which often enter a -rc phase on Mondays or Tuesdays
> and then are released two to three days later. And asking developers to
> create, review, and mainline fixes within one week might be too much to
> ask for in general. Hence tone the general goal down to three weeks and
> use an approach that better aligns with the usual merging and release
> habits.
> 
> While at it, also make the rules of thumb a bit easier to follow by
> grouping them by topic (e.g. generic things, timing, procedures, ...).
> 
> Also add text for a few cases where recent discussions showed they need
> covering. Among them are multiple points that better explain the
> relations to stable and longterm kernels and the team that manages them;
> they and the group seperators are the primary reason why this whole
> section sadly grew somewhat in the rewrite.
> 
> The group about those relations led to one addition the author came up
> with without any precedent from Linus: the text now tells developers to
> add a stable tag for any regression that made it into a proper mainline
> release during the past 12 months. This is meant to ensure the stable
> team will definitely notice any fixes for recent regressions. That
> includes those introduced shortly before a new mainline release and
> found right after it; without such a rule the stable team might miss the
> fix, which then would only reach users after weeks or months with later
> releases.
> 
> Note, the aspect "Do not consider regressions from the current cycle as
> something that can wait till the cycle's end [...]" might look like an
> addition, but was kinda was in the old text as well -- but only
> indirectly. That apparently was too subtle, as many developers seem to
> assume waiting till the end of the cycle is fine (even for build
> fixes).
> 
> In practice this was especially problematic when a cause of a regression
> made it into a proper release (either directly or through a backport). A
> revert performed by Linus shortly before the 6.3 release illustrated
> that[2], as the developer of the culprit had been willing to revert the
> culprit about three weeks earlier already -- but didn't do so when a fix
> came into sight and a maintainer suggested it can wait. Due to that the
> issue in the end plagued users of 6.2.y at least two weeks longer than
> necessary, as the fix in the end didn't become ready in time. This issue
> in fact could have been resolved one or two additional weeks earlier, if
> the developer had reverted the culprit shortly after it had been
> identified (which even the old version of the text suggest to do in such
> cases).
> 
> [1] https://lore.kernel.org/all/CAHk-=wis_qQy4oDNynNKi5b7Qhosmxtoj1jxo5wmB6SRUwQUBQ@xxxxxxxxxxxxxx/
> 
> [2] https://lore.kernel.org/all/CAHk-=wgD98pmSK3ZyHk_d9kZ2bhgN6DuNZMAJaV0WTtbkf=RDw@xxxxxxxxxxxxxx/
> 
> CC: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> CC: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
> CC: Lukas Bulwahn <lukas.bulwahn@xxxxxxxxx>
> Signed-off-by: Thorsten Leemhuis <linux@xxxxxxxxxxxxx>

Acked-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>