On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote: > [adding fstests and Zorro] > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote: > > > > I've been promoting the idea that running fstests once is nice, > > but things get interesting if you try to run fstests multiple > > times until a failure is found. It turns out at least kdevops has > > found tests which fail with a failure rate of typically 1/2 to > > 1/30 average failure rate. That is 1/2 means a failure can happen > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > failure. > > > > I have tried my best to annotate failure rates when I know what > > they might be on the test expunge list, as an example: > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > to propose to standardize a way to represent this. How about > > > > generic/530 # F:1/15 > > > > I am not fond of the 1/15 annotation at all, because the only fact that you > are able to document is that the test failed after 15 runs. > Suggesting that this means failure rate of 1/15 is a very big step. > > > Then we could extend the definition. F being current estimate, and this > > can be just how long it took to find the first failure. A more valuable > > figure would be failure rate avarage, so running the test multiple > > times, say 10, to see what the failure rate is and then averaging the > > failure out. So this could be a more accurate representation. For this > > how about: > > > > generic/530 # FA:1/15 > > > > This would mean on average there failure rate has been found to be about > > 1/15, and this was determined based on 10 runs. > > > > We should also go extend check for fstests/blktests to run a test > > until a failure is found and report back the number of successes. > > > > Thoughts? > > > > I have had a discussion about those tests with Zorro. Hi Amir, Thanks for publicing this discussion. Yes, we talked about this, but if I don't rememeber wrong, I recommended each downstream testers maintain their own "testing data/config", likes exclude list, failed ratio, known failures etc. I think they're not suitable to be fixed in the mainline fstests. About the other idea I metioned in LSF, we can create some more group names to mark those cases with random load/data/env etc, they're worth to be run more times. I also talked about that with Darrick, we haven't maken a decision, but I'd like to push that if most of other forks would like to see that. In my internal regression test for RHEL, I give some fstests cases a new group name "redhat_random" (sure, I know it's not a good name, it's just for my internal test, welcome better name, I'm not a good english speaker :). Then combine with quick and stress group name, I loop run "redhat_random" cases different times, with different LOAD/TIME_FACTOR. So I hope to have one "or more specific" group name to mark those random test cases at first, likes [1] (I'm sure it's incomplete, but can be improved if we can get more help from more people :) Thanks, Zorro [1] generic/013 generic/019 generic/051 generic/068 generic/070 generic/075 generic/076 generic/083 generic/091 generic/112 generic/117 generic/127 generic/231 generic/232 generic/233 generic/263 generic/269 generic/270 generic/388 generic/390 generic/413 generic/455 generic/457 generic/461 generic/464 generic/475 generic/476 generic/482 generic/521 generic/522 generic/547 generic/551 generic/560 generic/561 generic/616 generic/617 generic/648 generic/650 xfs/011 xfs/013 xfs/017 xfs/032 xfs/051 xfs/057 xfs/068 xfs/079 xfs/104 xfs/137 xfs/141 xfs/167 xfs/297 xfs/305 xfs/442 xfs/517 > > Those tests that some people refer to as "flaky" are valuable, > but they are not deterministic, they are stochastic. > > I think MTBF is the standard way to describe reliability > of such tests, but I am having a hard time imagining how > the community can manage to document accurate annotations > of this sort, so I would stick with documenting the facts > (i.e. the test fails after N runs). > > OTOH, we do have deterministic tests, maybe even the majority of > fstests are deterministic(?) > > Considering that every auto test loop takes ~2 hours on our rig and that > I have been running over 100 loops over the past two weeks, if half > of fstests are deterministic, that is a lot of wait time and a lot of carbon > emission gone to waste. > > It would have been nice if I was able to exclude a "deterministic" group. > The problem is - can a developer ever tag a test as being "deterministic"? > > Thanks, > Amir. >