[adding fstests and Zorro] On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote: > > I've been promoting the idea that running fstests once is nice, > but things get interesting if you try to run fstests multiple > times until a failure is found. It turns out at least kdevops has > found tests which fail with a failure rate of typically 1/2 to > 1/30 average failure rate. That is 1/2 means a failure can happen > 50% of the time, whereas 1/30 means it takes 30 runs to find the > failure. > > I have tried my best to annotate failure rates when I know what > they might be on the test expunge list, as an example: > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > The term "failure rate 1/15" is 16 characters long, so I'd like > to propose to standardize a way to represent this. How about > > generic/530 # F:1/15 > I am not fond of the 1/15 annotation at all, because the only fact that you are able to document is that the test failed after 15 runs. Suggesting that this means failure rate of 1/15 is a very big step. > Then we could extend the definition. F being current estimate, and this > can be just how long it took to find the first failure. A more valuable > figure would be failure rate avarage, so running the test multiple > times, say 10, to see what the failure rate is and then averaging the > failure out. So this could be a more accurate representation. For this > how about: > > generic/530 # FA:1/15 > > This would mean on average there failure rate has been found to be about > 1/15, and this was determined based on 10 runs. > > We should also go extend check for fstests/blktests to run a test > until a failure is found and report back the number of successes. > > Thoughts? > I have had a discussion about those tests with Zorro. Those tests that some people refer to as "flaky" are valuable, but they are not deterministic, they are stochastic. I think MTBF is the standard way to describe reliability of such tests, but I am having a hard time imagining how the community can manage to document accurate annotations of this sort, so I would stick with documenting the facts (i.e. the test fails after N runs). OTOH, we do have deterministic tests, maybe even the majority of fstests are deterministic(?) Considering that every auto test loop takes ~2 hours on our rig and that I have been running over 100 loops over the past two weeks, if half of fstests are deterministic, that is a lot of wait time and a lot of carbon emission gone to waste. It would have been nice if I was able to exclude a "deterministic" group. The problem is - can a developer ever tag a test as being "deterministic"? Thanks, Amir.