On 04.01.22 15:17, Lukas Bulwahn wrote: > On Mon, Jan 3, 2022 at 3:23 PM Thorsten Leemhuis <linux@xxxxxxxxxxxxx> wrote: >> >> Create a document explaining various aspects around regression handling >> and tracking both for users and developers. Among others describe the >> first rule of Linux kernel development and what it means in practice. >> Also explain what a regression actually is and how to report them >> properly. The text additionally provides a brief introduction to the bot >> the kernel's regression tracker users to facilitate the work. To sum >> things up, provide a few quotes from Linus to show how serious the he >> takes regressions. >> >> [...] > > [lots of helpful suggestions for fixes and small improvements] Many thx, addressed all of them, not worth commenting on each of them individually. >> +What is the goal of the 'no regressions rule'? >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> + >> +Users should feel safe when updating kernel versions and not have to worry >> +something might break. This is in the interest of the kernel developers to make >> +updating attractive: they don't want users to stay on stable or longterm Linux >> +series either abandoned or more than one and a half year old, as `those might >> +have known problems, security issues, or other aspects already improved in later >> +versions >> +<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_. >> + > Maybe add something like this: > > A larger user community means more exposure and more confidence that > any critical bug introduced is likely to be found closer to the point > in time it was introduced, and hence the shipped kernels have less > critical bugs. > > Just to close the line of thought here. Hmmm. How about this instead: The kernel developers also want to make it simple and appealing for users to test the latest (pre-)release, as it's a lot easier to track down and fix problems, if they are reported shortly after being introduced. > Okay, that is how far I got reading for now. Great, many thx for your help, much appreciated. FWIW, find below the current version of the plain text which contains a few more fixes. Note, thunderbird will insert wrong line breaks here. Ciao, Thorsten Does it qualify as a regression if a newer kernel works slower or makes the system consume more energy? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It does, but the difference has to be significant. A five percent slow-down in a micro-benchmark thus is unlikely to qualify as regression, unless it also influences the results of a broad benchmark by more than one percent. If in doubt, ask for advice. Is it a regression, if an externally developed kernel module is incompatible with a newer kernel? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No, as the 'no regression' rule is about interfaces and services the Linux kernel provides to the userland. It thus does not cover building or running externally developed kernel modules, as they run in kernel-space and use occasionally changed internal interfaces to hook into the kernel. How are regressions handled that are caused by a fix for security vulnerability? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In extremely rare situations security issues can't be fixed without causing regressions; those are given way, as they are the lesser evil in the end. Luckily this almost always can be avoided, as key developers for the affected area and often Linus Torvalds himself try very hard to fix security issues without causing regressions. If you nevertheless face such a case, check the mailing list archives if people tried their best to avoid the regression; if in doubt, ask for advice as outlined above. What happens if fixing a regression is impossible without causing another regression? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sadly these things happen, but luckily not very often; if they occur, expert developers of the affected code area should look into the issue to find a fix that avoids regressions or at least their impact. If you run into such a situation you thus do what was outlined already for regressions caused by security fixes: check earlier discussions if people already tried their best and ask for advice if in doubt. A quick note while at it: these situations could be avoided, if you would regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each cycle a test run. This is best explained by imagining a change integrated between Linux v5.14 and v5.15-rc1 which causes a regression, but at the same time is a hard requirement for some other improvement applied for 5.15-rc1. All these changes often can simply be reverted and the regression thus solved, if someone finds and reports it before 5.15 is released. A few days or weeks later after the release this solution might become impossible, if some software starts to rely on aspects introduced by one of the follow-up changes: reverting all changes would cause regressions for users of said software and thus out of the question. A feature I relied on was removed months ago, but I only noticed now. Does that qualify as regression? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It does, but often it's hard to fix them due to the aspects outlined in the previous section. It hence needs to be dealt with on a case-by-case basis; this is another reason why it's in your interest to regularly test mainline releases. Does the 'no regression' rule apply if I seem to be the only person in the world that is affected by a regression? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It does, but only for practical usage: the Linux developers want to be free to remove support for hardware only to be found in attics and museums anymore. Note, sometimes regressions can't be avoided to make progress -- and the latter is needed to prevent Linux from stagnation. Hence, if only very few users seem to be affected by a regression, it for the greater good might be in their and everyone else's interest to not insist on the rule. Especially if there is an easy way to circumvent the regression somehow, for example by updating some software or using a kernel parameter created just for this purpose. Does the regression rule apply for code in the staging tree as well? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Not according to the `help text for the configuration option covering all staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_, which since its early days states:: Please note that these drivers are under heavy development, may or may not work, and may contain userspace interfaces that most likely will be changed in the near future. The staging developers nevertheless often adhere to the 'no regressions' rule, but sometimes bend it to make progress. That's for example why some users had to deal with (often negligible) regressions when a WiFi driver from the staging tree was replaced by a totally different one written from scratch. Why do later versions have to be 'compiled with a similar configuration'? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Because the Linux kernel developers sometimes integrate changes known to cause regressions, but make them optional and disable them in the kernel's default configuration. This trick allows progress, as the 'no regressions' rule otherwise would lead to stagnation. Consider for example a new security feature which blocks access to some kernel interfaces often abused by malware, but at the same time are required to run a few rarely used applications. The outlined trick makes both camps happy: people using these applications can leave the new security feature off, while everyone else can enable it without running into trouble. How to create a configuration similar to the one of an older kernel? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Start a known-good kernel and configure the newer Linux version with ``make olddefconfig``. This makes the kernel's build scripts pick up the configuration file (the `.config` file) from the running kernel as base for the new one you are about to compile; afterwards they set all new configuration options to their default value, which disables new features that might cause regressions. Can I report a regression with vanilla kernels provided by someone else to the upstream Linux kernel developers? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Only if the newer kernel was compiled with a similar configuration file as the older one (see above), as your provider might have enabled some known-to-be incompatible feature in the newer kernel. If in a doubt, report this problem to the provider and ask for advice. More details about regressions relevant for developers ------------------------------------------------------ What should I do, if I suspect a change I'm working on might cause regressions? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Evaluate how big the risk of regressions is, for example by performing a code search in Linux distributions and Git forges. Also consider asking other developers or projects likely to be affected to evaluate or even test the proposed change; if problems surface, maybe some middle ground acceptable for all can be found. If the risk of regressions in the end seems to be relatively small, go ahead with the change, but let all involved parties know about the risk. Hence, make sure your patch description makes this aspect obvious. Once the change is merged, tell the Linux kernel's regression tracker and the regressions mailing list about the risk, so everyone has the change on the radar in case reports trickle in. Depending on the risk, you also might want to ask the subsystem maintainer to mention the issue in his pull request to mainline. Everything developers need to know about regression tracking ------------------------------------------------------------ Do I have to use regzbot? ~~~~~~~~~~~~~~~~~~~~~~~~~ It's in the interest of everyone if you do, as kernel maintainers like Linus Torvalds partly rely on regzbot's tracking in their work -- for example when deciding to release a new version or extend the development phase. For this they need to be aware of all unfixed regression; to do that, Linus is known to look into the weekly reports sent by regzbot. Do I have to tell regzbot about every regression I stumble upon? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ideally yes: we are all humans and easily forget problems when something more important unexpectedly comes up -- for example a bigger problem in the Linux kernel or something in real life that's keeping us away from keyboards for a while. Hence, it's best to tell regzbot about every regression, except when you immediately write a fix and commit it to a tree regularly merged to the affected kernel series. Why does the Linux kernel need a regression tracker, and why does he utilize regzbot? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Rules like 'no regressions' need someone to enforce them, otherwise they are broken either accidentally or on purpose. History has shown that this is true for the Linux kernel as well. That's why Thorsten volunteered to keep an eye on things. Tracking regressions completely manually has proven to be exhausting and demotivating, which is why earlier attempts to establish it failed after a while. To prevent this from happening again, Thorsten developed Regzbot to facilitate the work, with the long term goal to automate regression tracking as much as possible for everyone involved. How does regression tracking work with regzbot? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The bot keeps track of all the reports and monitors their fixing progress. It tries to do that with as little overhead as possible for both reporters and developers. In fact, only reporters or someone helping them are burdened with an extra duty: they need to tell regzbot about the regression report using one of the ``#regzbot introduced`` commands outlined above. For developers there normally is no extra work involved, they just need to do something that's expected from them already: add 'Link:' tags to the patch description pointing to all reports about the issue fixed. Thanks to these tags regzbot can associate regression reports with patches to fix the issue, whenever they are posted for review or applied to a git tree. The bot additionally watches out for replies to the report. All this data combined provides a good impression about the current status of the fixing process. How to see which regressions regzbot tracks currently? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_ for the latest info; alternatively, `search for the latest regression report <https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_, which regzbot normally sends out once a week on Sunday evening (UTC), which is a few hours before Linus usually publishes new (pre-)releases. What places is regzbot monitoring? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Regzbot is watching the most important Linux mailing lists as well as the linux-next, mainline and stable/longterm git repositories. How to interact with regzbot? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Everyone can interact with the bot using mails containing `regzbot commands`, which need to be in their own paragraph (IOW: they need to be separated from the rest of the mail using blank lines). One such command is ``#regzbot introduced <version or commit>``, which adds a report to the tracking, as already described above; ``#regzbot ^introduced <version or commit>`` is another such command, which makes regzbot consider the parent mail as a report for a regression which it starts to track. Once one of those two commands has been utilized, other regzbot commands can be used. You can write them below one of the `introduced` commands or in replies to the mail that used one of them or itself is a reply to that mail: * Set or update the title:: #regzbot title: foo * Link to a related discussion (for example the posting of a patch to fix the issue) and monitor it:: #regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@xxxxxxxxxxxxxxxxxx/ Monitoring only works for lore.kernel.org; regzbot will consider all messages in that thread as related to the fixing process. * Point to a place with further details, like a bug tracker or a related mailing list post:: #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789 * Mark a regression as fixed by a commit that is heading upstream or already landed:: #regzbot fixed-by: 1f2e3d4c5d * Mark a regression as a duplicate of another one already tracked by regzbot:: #regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@xxxxxxxxxxxxxxxxxx/ * Mark a regression as invalid:: #regzbot invalid: wasn't a regression, problem has always existed Is there more to tell about regzbot and its commands? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ More detailed and up-to-date information about the Linux kernels regression tracking bot can be found on its `project page <https://gitlab.com/knurd42/regzbot>`_, which among others contains a `getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_ and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_ which both are more in-depth.