Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 19 May 2022 09:36:41 +0300

[adding fstests and Zorro]

On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
>
> I've been promoting the idea that running fstests once is nice,
> but things get interesting if you try to run fstests multiple
> times until a failure is found. It turns out at least kdevops has
> found tests which fail with a failure rate of typically 1/2 to
> 1/30 average failure rate. That is 1/2 means a failure can happen
> 50% of the time, whereas 1/30 means it takes 30 runs to find the
> failure.
>
> I have tried my best to annotate failure rates when I know what
> they might be on the test expunge list, as an example:
>
> workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
>
> The term "failure rate 1/15" is 16 characters long, so I'd like
> to propose to standardize a way to represent this. How about
>
> generic/530 # F:1/15
>

I am not fond of the 1/15 annotation at all, because the only fact that you
are able to document is that the test failed after 15 runs.
Suggesting that this means failure rate of 1/15 is a very big step.

> Then we could extend the definition. F being current estimate, and this
> can be just how long it took to find the first failure. A more valuable
> figure would be failure rate avarage, so running the test multiple
> times, say 10, to see what the failure rate is and then averaging the
> failure out. So this could be a more accurate representation. For this
> how about:
>
> generic/530 # FA:1/15
>
> This would mean on average there failure rate has been found to be about
> 1/15, and this was determined based on 10 runs.
>
> We should also go extend check for fstests/blktests to run a test
> until a failure is found and report back the number of successes.
>
> Thoughts?
>

I have had a discussion about those tests with Zorro.

Those tests that some people refer to as "flaky" are valuable,
but they are not deterministic, they are stochastic.

I think MTBF is the standard way to describe reliability
of such tests, but I am having a hard time imagining how
the community can manage to document accurate annotations
of this sort, so I would stick with documenting the facts
(i.e. the test fails after N runs).

OTOH, we do have deterministic tests, maybe even the majority of
fstests are deterministic(?)

Considering that every auto test loop takes ~2 hours on our rig and that
I have been running over 100 loops over the past two weeks, if half
of fstests are deterministic, that is a lot of wait time and a lot of carbon
emission gone to waste.

It would have been nice if I was able to exclude a "deterministic" group.
The problem is - can a developer ever tag a test as being "deterministic"?

Thanks,
Amir.