On Thu, May 19, 2022 at 09:36:41AM +0300, Amir Goldstein wrote: > [adding fstests and Zorro] > > On Thu, May 19, 2022 at 6:07 AM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote: > > > > I've been promoting the idea that running fstests once is nice, > > but things get interesting if you try to run fstests multiple > > times until a failure is found. It turns out at least kdevops has > > found tests which fail with a failure rate of typically 1/2 to > > 1/30 average failure rate. That is 1/2 means a failure can happen > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > failure. > > > > I have tried my best to annotate failure rates when I know what > > they might be on the test expunge list, as an example: > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > to propose to standardize a way to represent this. How about > > > > generic/530 # F:1/15 > > > > I am not fond of the 1/15 annotation at all, because the only fact that you > are able to document is that the test failed after 15 runs. > Suggesting that this means failure rate of 1/15 is a very big step. > > > Then we could extend the definition. F being current estimate, and this > > can be just how long it took to find the first failure. A more valuable > > figure would be failure rate avarage, so running the test multiple > > times, say 10, to see what the failure rate is and then averaging the > > failure out. So this could be a more accurate representation. For this > > how about: > > > > generic/530 # FA:1/15 > > > > This would mean on average there failure rate has been found to be about > > 1/15, and this was determined based on 10 runs. These tests are run on multiple different filesystems. What happens if you run xfs, ext4, btrfs, overlay in sequence? We now have 4 tests results, and 1 failure. Does that make it FA: 1/4, or does it make it 1/1,0/1,0/1,0/1? What happens if we run, say, XFS w/ defaults, rmapbt=1, v4, quotas? Does that make it FA: 1/4, or does it make it 0/1,1/1,0/1,0/1? In each case above, 1/4 tells us nothing useful. OTOH, the 0/1 vs 1/1 breakdown is useful information, because it tells us whihc filesystem failed the test, or which specific config failed the test. Hence I think the ability for us to draw useful conclusions from a number like this is large dependent on the specific data set it is drawn from... > > We should also go extend check for fstests/blktests to run a test > > until a failure is found and report back the number of successes. > > > > Thoughts? Who is the expected consumer of this information? I'm not sure it will be meaningful for anyone developing new code and needing to run every test every time they run fstests. OTOH, for a QA environment where you have a fixed progression of the kernel releases you are testing, it's likely valuable and already being tracked in various distro QE management tools and dashboards.... > I have had a discussion about those tests with Zorro. > > Those tests that some people refer to as "flaky" are valuable, > but they are not deterministic, they are stochastic. Extremely valuable. Worth their weight in gold to developers like me. The recoveryloop group tests are a good example of this. The name of the group indicates how we use it. I typically set it up to run with an loop iteration like "-I 100" knowing that is will likely fail a random test in the group within 10 iterations. Those one-off failures are almost always a real bug, and they are often unique and difficult to reproduce exactly. Post-mortem needs to be performed immediately because it may well be a unique on-off failure and running another test after the failure destroys the state needed to perform a post-mortem. Hence having a test farm running these multiple times and then reporting "failed once in 15 runs" isn't really useful to me as a developer - it doesn't tell us anything new, nor does it help us find the bugs that are being tripped over. Less obvious stochastic tests exist, too. There are many tests that use fstress as a workload that runs while some other operation is performed - freeze, grow, ENOSPC, error injections, etc. They will never be deterministic, any again any failure tends to be a real bug, too. However, I think these should be run by QE environments all the time as they require long term, frequent execution across different configs in different environments to find the deep dark corners where the bugs may lie dormant. These are the tests that find things like subtle timing races no other tests ever exercise. I suspect that tests that alter their behaviour via LOAD_FACTOR or TIME_FACTOR will fall into this category. > I think MTBF is the standard way to describe reliability > of such tests, but I am having a hard time imagining how > the community can manage to document accurate annotations > of this sort, so I would stick with documenting the facts > (i.e. the test fails after N runs). I'm unsure of what "reliablity of such tests" means in this context. The tests are trying to exercise and measure the reliability of the kernel code - if the *test is unreliable* then that says to me the test needs fixing. If the test is reliable, then any failures that occur indicate that the filesystem/kernel/fs tools are unreliable, not the test.... "test reliability" and "reliability of filesystem under test" are different things with similar names. The latter is what I think we are talking about measuring and reporting here, right? > OTOH, we do have deterministic tests, maybe even the majority of > fstests are deterministic(?) Very likely. As a generalisation, I'd say that anything that has a fixed, single step at a time recipe and a very well defined golden output or exact output comparison match is likely deterministic. We use things like 'within tolerance' so that slight variations in test results don't cause spurious failures and hence make the test more deterministic. Hence any test that uses 'within_tolerance' is probably a test that is expecting deterministic behaviour.... > Considering that every auto test loop takes ~2 hours on our rig and that > I have been running over 100 loops over the past two weeks, if half > of fstests are deterministic, that is a lot of wait time and a lot of carbon > emission gone to waste. > > It would have been nice if I was able to exclude a "deterministic" group. > The problem is - can a developer ever tag a test as being "deterministic"? fstests allows private exclude lists to be used - perhaps these could be used to start building such a group for your test environment. Building a list from the tests you never see fail in your environment could be a good way to seed such a group... Maybe you have all the raw results from those hundreds of tests sitting around - what does crunching that data look like? Who else has large sets of consistent historic data sitting around? I don't because I pollute my results archive by frequently running varied and badly broken kernels through fstests, but people who just run released or stable kernels may have data sets that could be used.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx