Re: [PATH 5.10 0/4] xfs stable candidate patches for 5.10.y (part 1)

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 26 May 2022 21:59:19 +0300

On Thu, May 26, 2022 at 9:44 PM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
>
> On Thu, May 26, 2022 at 10:27:41AM -0700, Darrick J. Wong wrote:
> > /me looks and sees a large collection of expunge lists, along with
> > comments about how often failures occur and/or reasons.  Neat!
> >
> > Leah mentioned on the ext4 call this morning that she would have found
> > it helpful to know (before she started working on 5.15 backports) which
> > tests were of the flaky variety so that she could better prioritize the
> > time she had to look into fstests failures.  (IOWS: saw a test fail a
> > small percentage of the time and then burned a lot of machine time only
> > to figure out that 5.15.0 also failed a percentage of th time).
>
> See my proposal to try to make this easier to parse:
>
> https://lore.kernel.org/all/YoW0ZC+zM27Pi0Us@xxxxxxxxxxxxxxxxxxxxxx/
>
> > We talked about where it would be most useful for maintainers and QA
> > people to store their historical pass/fail data, before settling on
> > "somewhere public where everyone can review their colleagues' notes" and
> > "somewhere minimizing commit friction".  At the time, we were thinking
> > about having people contribute their notes directly to the fstests
> > source code, but I guess Luis has been doing that in the kdevops repo
> > for a few years now.
> >
> > So, maybe there?
>
> For now sure, I'm happy to add others the linux-kdevops org on github
> and they get immediate write access to the repo. This is working well
> so far. Long term we need to decide if we want to spin off the
> expunge list as a separate effort and make it a git subtree (note
> it is different than a git sub module). Another example of a use case
> for a git subtree, to use it as an example, is the way I forked
> kconfig from Linux into a standalone git tree so to allow any project
> to bump kconfig code with just one command. So different projects
> don't need to fork kconfig as they do today.
>
> The value in doing the git subtree for expunges is any runner can use
> it. I did design kdevops though to run on *any* cloud, and support
> local virtualization tech like libvirt and virtualbox.
>
> The linux-kdevops git org also has other projects which both fstest
> and blktests depend on, so for example dbench which I had forked and
> cleaned up a while ago. It may make sense to share keeping oddball
> efforts like thse which are no longer maintained in this repo.
>
> There is other tech I'm evaluating for this sort of collaborative test
> efforts such as ledgers, but that is in its infancy at this point in
> time. I have a sense though it may be a good outlet for collection of
> test artifacts in a decentralized way and also allow *any* entity to
> participate in bringing confidence to stable kernel branches or dev
> branches prior to release.
>

There are few problems I noticed with the current workflow.

1. It will not scale to maintain this in git as more and more testers
start using kdepops and adding more and more fs and configs and distros.
How many more developers you want to give push access to linux-kdevops?
I don't know how test labs report to KernelCI, but we need to look at their
model and not invent the wheel.

2. kdevops is very focused on stabilizing the baseline fast, which is
very good, but there is no good process of getting a test out of expunge list.
We have a very strong suspicion that some of the tests that we put in
expunge lists failed due to some setup issue in the host OS that caused
NVME IO errors in the guests. I tried to put that into comments when
I noticed that, but I am afraid there may have been other tests that are
falsely accused of failing. All developers make those mistakes in their
own expunge lists, but if we start propagating those mistakes to the world,
it becomes an issue.

For those two reasons I think that the model to aspire to should be
composed of a database where absolutely everyone can post data
point to in the form of facts (i.e. the test failed after N runs on this kernel
and this hardware...) and another process, partly AI, partly human to
digest all those facts into a knowledge base that is valuable and
maintained by experts. Much easier said than done...

Thanks,
Amir.