Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges

Zorro Lang <zlang@xxxxxxxxxx> · Thu, 19 May 2022 19:24:50 +0800

On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote:
> [adding fstests and Zorro]
> 
> On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote:
> >
> > I've been promoting the idea that running fstests once is nice,
> > but things get interesting if you try to run fstests multiple
> > times until a failure is found. It turns out at least kdevops has
> > found tests which fail with a failure rate of typically 1/2 to
> > 1/30 average failure rate. That is 1/2 means a failure can happen
> > 50% of the time, whereas 1/30 means it takes 30 runs to find the
> > failure.
> >
> > I have tried my best to annotate failure rates when I know what
> > they might be on the test expunge list, as an example:
> >
> > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d
> >
> > The term "failure rate 1/15" is 16 characters long, so I'd like
> > to propose to standardize a way to represent this. How about
> >
> > generic/530 # F:1/15
> >
> 
> I am not fond of the 1/15 annotation at all, because the only fact that you
> are able to document is that the test failed after 15 runs.
> Suggesting that this means failure rate of 1/15 is a very big step.
> 
> > Then we could extend the definition. F being current estimate, and this
> > can be just how long it took to find the first failure. A more valuable
> > figure would be failure rate avarage, so running the test multiple
> > times, say 10, to see what the failure rate is and then averaging the
> > failure out. So this could be a more accurate representation. For this
> > how about:
> >
> > generic/530 # FA:1/15
> >
> > This would mean on average there failure rate has been found to be about
> > 1/15, and this was determined based on 10 runs.
> >
> > We should also go extend check for fstests/blktests to run a test
> > until a failure is found and report back the number of successes.
> >
> > Thoughts?
> >
> 
> I have had a discussion about those tests with Zorro.

Hi Amir,

Thanks for publicing this discussion.

Yes, we talked about this, but if I don't rememeber wrong, I recommended each
downstream testers maintain their own "testing data/config", likes exclude
list, failed ratio, known failures etc. I think they're not suitable to be
fixed in the mainline fstests.

About the other idea I metioned in LSF, we can create some more group names to
mark those cases with random load/data/env etc, they're worth to be run more
times. I also talked about that with Darrick, we haven't maken a decision,
but I'd like to push that if most of other forks would like to see that.

In my internal regression test for RHEL, I give some fstests cases a new
group name "redhat_random" (sure, I know it's not a good name, it's just
for my internal test, welcome better name, I'm not a good english speaker :).
Then combine with quick and stress group name, I loop run "redhat_random"
cases different times, with different LOAD/TIME_FACTOR.

So I hope to have one "or more specific" group name to mark those random
test cases at first, likes [1] (I'm sure it's incomplete, but can be improved
if we can get more help from more people :)

Thanks,
Zorro

[1]
generic/013
generic/019
generic/051
generic/068
generic/070
generic/075
generic/076
generic/083
generic/091
generic/112
generic/117
generic/127
generic/231
generic/232
generic/233
generic/263
generic/269
generic/270
generic/388
generic/390
generic/413
generic/455
generic/457
generic/461
generic/464
generic/475
generic/476
generic/482
generic/521
generic/522
generic/547
generic/551
generic/560
generic/561
generic/616
generic/617
generic/648
generic/650
xfs/011
xfs/013
xfs/017
xfs/032
xfs/051
xfs/057
xfs/068
xfs/079
xfs/104
xfs/137
xfs/141
xfs/167
xfs/297
xfs/305
xfs/442
xfs/517

> 
> Those tests that some people refer to as "flaky" are valuable,
> but they are not deterministic, they are stochastic.
> 
> I think MTBF is the standard way to describe reliability
> of such tests, but I am having a hard time imagining how
> the community can manage to document accurate annotations
> of this sort, so I would stick with documenting the facts
> (i.e. the test fails after N runs).
> 
> OTOH, we do have deterministic tests, maybe even the majority of
> fstests are deterministic(?)
> 
> Considering that every auto test loop takes ~2 hours on our rig and that
> I have been running over 100 loops over the past two weeks, if half
> of fstests are deterministic, that is a lot of wait time and a lot of carbon
> emission gone to waste.
> 
> It would have been nice if I was able to exclude a "deterministic" group.
> The problem is - can a developer ever tag a test as being "deterministic"?
> 
> Thanks,
> Amir.
>