Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 19 May 2022 12:20:28 +0300

On Thu, May 19, 2022 at 10:58 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> > [adding fstests and Zorro]
> >
> > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
> > >
> > > I've been promoting the idea that running fstests once is nice,
> > > but things get interesting if you try to run fstests multiple
> > > times until a failure is found. It turns out at least kdevops has
> > > found tests which fail with a failure rate of typically 1/2 to
> > > 1/30 average failure rate. That is 1/2 means a failure can happen
> > > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > > failure.
> > >
> > > I have tried my best to annotate failure rates when I know what
> > > they might be on the test expunge list, as an example:
> > >
> > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> > >
> > > The term "failure rate 1/15" is 16 characters long, so I'd like
> > > to propose to standardize a way to represent this. How about
> > >
> > > generic/530 # F:1/15
> > >
> >
> > I am not fond of the 1/15 annotation at all, because the only fact that you
> > are able to document is that the test failed after 15 runs.
> > Suggesting that this means failure rate of 1/15 is a very big step.
> >
> > > Then we could extend the definition. F being current estimate, and this
> > > can be just how long it took to find the first failure. A more valuable
> > > figure would be failure rate avarage, so running the test multiple
> > > times, say 10, to see what the failure rate is and then averaging the
> > > failure out. So this could be a more accurate representation. For this
> > > how about:
> > >
> > > generic/530 # FA:1/15
> > >
> > > This would mean on average there failure rate has been found to be about
> > > 1/15, and this was determined based on 10 runs.
>
> These tests are run on multiple different filesystems. What happens
> if you run xfs, ext4, btrfs, overlay in sequence? We now have 4
> tests results, and 1 failure.
>
> Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1?
>
> What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas?
>
> Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1?
>
> In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs
> 1/1 breakdown is useful information, because it tells us whihc
> filesystem failed the test, or which specific config failed the
> test.
>
> Hence I think the ability for us to draw useful conclusions from a
> number like this is large dependent on the specific data set it is
> drawn from...
>
> > > We should also go extend check for fstests/blktests to run a test
> > > until a failure is found and report back the number of successes.
> > >
> > > Thoughts?
>
> Who is the expected consumer of this information?
>
> I'm not sure it will be meaningful for anyone developing new code
> and needing to run every test every time they run fstests.
>
> OTOH, for a QA environment where you have a fixed progression of the
> kernel releases you are testing, it's likely valuable and already
> being tracked in various distro QE management tools and
> dashboards....
>
> > I have had a discussion about those tests with Zorro.
> >
> > Those tests that some people refer to as "flaky" are valuable,
> > but they are not deterministic, they are stochastic.
>
> Extremely valuable. Worth their weight in gold to developers like
> me.
>
> The recoveryloop group tests are a good example of this. The name of
> the group indicates how we use it. I typically set it up to run with
> an loop iteration like "-I 100" knowing that is will likely fail a
> random test in the group within 10 iterations.
>
> Those one-off failures are almost always a real bug, and they are
> often unique and difficult to reproduce exactly. Post-mortem needs
> to be performed immediately because it may well be a unique on-off
> failure and running another test after the failure destroys the
> state needed to perform a post-mortem.
>
> Hence having a test farm running these multiple times and then
> reporting "failed once in 15 runs" isn't really useful to me as a
> developer - it doesn't tell us anything new, nor does it help us
> find the bugs that are being tripped over.
>
> Less obvious stochastic tests exist, too. There are many tests that
> use fstress as a workload that runs while some other operation is
> performed - freeze, grow, ENOSPC, error injections, etc. They will
> never be deterministic, any again any failure tends to be a real
> bug, too.
>
> However, I think these should be run by QE environments all the time
> as they require long term, frequent execution across different
> configs in different environments to find the deep dark corners
> where the bugs may lie dormant. These are the tests that find things
> like subtle timing races no other tests ever exercise.
>
> I suspect that tests that alter their behaviour via LOAD_FACTOR or
> TIME_FACTOR will fall into this category.
>
> > I think MTBF is the standard way to describe reliability
> > of such tests, but I am having a hard time imagining how
> > the community can manage to document accurate annotations
> > of this sort, so I would stick with documenting the facts
> > (i.e. the test fails after N runs).
>
> I'm unsure of what "reliablity of such tests" means in this context.
> The tests are trying to exercise and measure the reliability of the
> kernel code - if the *test is unreliable* then that says to me the
> test needs fixing. If the test is reliable, then any failures that
> occur indicate that the filesystem/kernel/fs tools are unreliable,
> not the test....
>
> "test reliability" and "reliability of filesystem under test" are
> different things with similar names. The latter is what I think we
> are talking about measuring and reporting here, right?
>
> > OTOH, we do have deterministic tests, maybe even the majority of
> > fstests are deterministic(?)
>
> Very likely. As a generalisation, I'd say that anything that has a
> fixed, single step at a time recipe and a very well defined golden
> output or exact output comparison match is likely deterministic.
>
> We use things like 'within tolerance' so that slight variations in
> test results don't cause spurious failures and hence make the test
> more deterministic.  Hence any test that uses 'within_tolerance' is
> probably a test that is expecting deterministic behaviour....
>
> > Considering that every auto test loop takes ~2 hours on our rig and that
> > I have been running over 100 loops over the past two weeks, if half
> > of fstests are deterministic, that is a lot of wait time and a lot of carbon
> > emission gone to waste.
> >
> > It would have been nice if I was able to exclude a "deterministic" group.
> > The problem is - can a developer ever tag a test as being "deterministic"?
>
> fstests allows private exclude lists to be used - perhaps these
> could be used to start building such a group for your test
> environment. Building a list from the tests you never see fail in
> your environment could be a good way to seed such a group...
>
> Maybe you have all the raw results from those hundreds of tests
> sitting around - what does crunching that data look like? Who else
> has large sets of consistent historic data sitting around? I don't
> because I pollute my results archive by frequently running varied
> and badly broken kernels through fstests, but people who just run
> released or stable kernels may have data sets that could be used....
>

I have no historic data of that sort and I have never stayed on the
same test system long enough to collect this sort of data.

Josef has told us in LPC 2021 about his btrfs fstests dashboard
where he started to collect historical data a while ago.

Collaborating on expunge lists of different fs and different
kernel/config/distro
is one of the goals behind Luis's kdevops project.

For now, the expunge lists are curated in git:
https://github.com/linux-kdevops/kdevops/tree/master/workflows/fstests/expunges
Going forward, this cannot scale. If we want to collaborate and
collect results from
multiple testers and test labs we should consult with the KernelCI
project, who are
doing exactly that for other test suites.

You did not attend Luis' talk in LSFMM this year (he has already mentioned
kdevops back in LSFMM 2019), where some of these issues were discussed.
The video from LSFMM 2022 talk should be available in coming weeks.
I hear that Luis is also planning on giving a talk to a wider audience
in LPC 2022.

Thanks,
Amir.

> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx