On Wed, Dec 02, 2020 at 10:04:17PM +0100, Greg Kroah-Hartman wrote: > On Thu, Dec 03, 2020 at 07:40:45AM +1100, Dave Chinner wrote: > > On Wed, Dec 02, 2020 at 08:06:01PM +0100, Greg Kroah-Hartman wrote: > > > On Wed, Dec 02, 2020 at 06:41:43PM +0100, Miklos Szeredi wrote: > > > > On Wed, Dec 2, 2020 at 5:24 PM David Howells <dhowells@xxxxxxxxxx> wrote: > > > > > > > > > > Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > > > > > > > > > > > Stable cc also? > > > > > > > > > > > > Cc: <stable@xxxxxxxxxxxxxxx> # 5.8 > > > > > > > > > > That seems to be unnecessary, provided there's a Fixes: tag. > > > > > > > > Is it? > > > > > > > > Fixes: means it fixes a patch, Cc: stable means it needs to be > > > > included in stable kernels. The two are not necessarily the same. > > > > > > > > Greg? > > > > > > You are correct. cc: stable, as is documented in > > > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html > > > ensures that the patch will get merged into the stable tree. > > > > > > Fixes: is independent of it. It's great to have for stable patches so > > > that I know how far back to backport patches. > > > > > > We do scan all commits for Fixes: tags that do not have cc: stable, and > > > try to pick them up when we can and have the time to do so. But it's > > > not guaranteed at all that this will happen. > > > > > > I don't know why people keep getting confused about this, we don't > > > document the "Fixes: means it goes to stable" anywhere... > > > > Except that is exactly what happens, sometimes within a day of two > > of a patch with a Fixes tag hitting Linus' kernel. We have had a > > serious XFS regression in the 5.9.9 stable kernel that should never > > have happened as a result of exactly this "Fixes = automatically > > swept immediately into stable kernels" behaviour. See here for > > post-mortem analysis: > > > > https://lore.kernel.org/linux-xfs/20201126071323.GF2842436@xxxxxxxxxxxxxxxxxxx/T/#m26e14ebd28ad306025f4ebf37e2aae9a304345a5 > > > > This happened because these auotmated Fixes scans seem to occur > > weekly during -rcX release periods, which means there really is *no > > practical difference* between the way the stable process treats > > Fixes tags and cc: stable. > > Sometimes, yes, that is true. But as it went into Linus's tree at the > same time, we just ended up with "bug compatible" trees :) > > Not a big deal overall, happens every few releases, we fix it up and > move on. The benifits in doing all of this _FAR_ outweigh the very > infrequent times that kernel developers get something wrong. I'm not debating that users benefit from backports. I'm talking about managing risk profiles and how to prevent an entirely preventable stable kernel regression from happening again. Talking about risk profiles, the issue here is that the regression that slipped through to the stable kernels had a -catastrophic- risk profile. That's exactly the sort of things that the stable kernel is supposed to avoid exposing users to, and that raises the importance and priority of ensuring that *never happens again*. And the cause of this regression slipping through to stable kernel users? It was a result of the automated "fixes" scan done by the stable process that results in "fixes" meaning the same thing as "cc: stable".... > As always, if you do NOT want your subsystem to have fixes: tags picked > up automatically by us for stable trees, just email us and let us know > to not do that and we gladly will. No, that is not an acceptible solution for anyone. The stable maintainers need to stop suggesting this as a solution to any criticism that is raised against the stable process. You may as well just say "shut up, go away, we don't care what you want". > > It seems like this can all be avoided simply by scheduling the > > automated fixes scans once the upstream kernel is released, not > > while it is still being stabilised by -rc releases. That way stable > > kernels get better tested fixes, they still get the same quantity of > > fixes, and upstream developers have some margin to detect and > > correct regressions in fixes before they get propagated to users. > > So the "magic" -final release from Linus would cause this to happen? > That means that the world would go for 3 months without some known fixes > being applied to the tree? That's not acceptable to me, as I started > doing this because it was needed to be done, not just because I wanted > to do more work... I'm not suggesting that all fixes across the entire kernel get held until release. That's just taking things to extremes for no valid reason as the risk profiles of most subsystems don't justify needing a margin of error that large. I'm asking that specific subsystems with catastrophic failure risk profiles be allowed to opt out of the "just merged" fixes scans and instead have them replaced by a less frequent scan. Perhaps we don't even need to wait for the full release. Maybe just increasing the fixes scanning window for those subsystems to pick up changes in -rc(X-2) so that the commits have been exposed to testing for a couple of weeks before being considered a stable backport candidate. That mitigates the immediate risk concern as it gives developers time to catch and fix regressions before stable backports are done. Such a 2 week delay would have avoided exposing stable kernel users to dangerous regression that should never have been released outside developer and test machines exercising the upstream -rcX tree. > > It also creates a clear demarcation between fixes and cc: stable for > > maintainers and developers: only patches with a cc: stable will be > > backported immediately to stable. Developers know what patches need > > urgent backports and, unlike developers, the automated fixes scan > > does not have the subject matter expertise or background to make > > that judgement.... > > Some subsystems do not have such clear demarcation at all. Heck, some > subsystems don't even add a cc: stable to known major fixes. And that's > ok, the goal of the stable kernel work is to NOT impose additional work > on developers or maintainers if they don't want to do that work. Engineering is as much about improving processes as it is about improving the thing that is being built. I'm not asking you to stop backporting fixes or stop improving the stable kernels. All I'm asking for is to increase the latency of backports for some subsystems because a margin of error is needed to minimise the risk profile stable users are exposed to. IOWs, I'm asking for a *minor tweak* to the existing process, not asking you to start all over again. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx