On Mon, Jul 04, 2022 at 10:58:22AM +0300, Amir Goldstein wrote: > On Mon, Jul 4, 2022 at 6:25 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Sun, Jul 03, 2022 at 08:56:54AM +0300, Amir Goldstein wrote: > > > On Sun, Jul 3, 2022 at 12:48 AM Bart Van Assche <bvanassche@xxxxxxx> wrote: > > > > > > > > On 5/18/22 20:07, Luis Chamberlain wrote: > > > > > I've been promoting the idea that running fstests once is nice, > > > > > but things get interesting if you try to run fstests multiple > > > > > times until a failure is found. It turns out at least kdevops has > > > > > found tests which fail with a failure rate of typically 1/2 to > > > > > 1/30 average failure rate. That is 1/2 means a failure can happen > > > > > 50% of the time, whereas 1/30 means it takes 30 runs to find the > > > > > failure. > > > > > > > > > > I have tried my best to annotate failure rates when I know what > > > > > they might be on the test expunge list, as an example: > > > > > > > > > > workflows/fstests/expunges/5.17.0-rc7/xfs/unassigned/xfs_reflink.txt:generic/530 # failure rate about 1/15 https://gist.github.com/mcgrof/4129074db592c170e6bf748aa11d783d > > > > > > > > > > The term "failure rate 1/15" is 16 characters long, so I'd like > > > > > to propose to standardize a way to represent this. How about > > > > > > > > > > generic/530 # F:1/15 > > > > > > > > > > Then we could extend the definition. F being current estimate, and this > > > > > can be just how long it took to find the first failure. A more valuable > > > > > figure would be failure rate avarage, so running the test multiple > > > > > times, say 10, to see what the failure rate is and then averaging the > > > > > failure out. So this could be a more accurate representation. For this > > > > > how about: > > > > > > > > > > generic/530 # FA:1/15 > > > > > > > > > > This would mean on average there failure rate has been found to be about > > > > > 1/15, and this was determined based on 10 runs. > > > > > > > > > > We should also go extend check for fstests/blktests to run a test > > > > > until a failure is found and report back the number of successes. > > > > > > > > > > Thoughts? > > > > > > > > > > Note: yes failure rates lower than 1/100 do exist but they are rare > > > > > creatures. I love them though as my experience shows so far that they > > > > > uncover hidden bones in the closet, and they they make take months and > > > > > a lot of eyeballs to resolve. > > > > > > > > I strongly disagree with annotating tests with failure rates. My opinion > > > > is that on a given test setup a test either should pass 100% of the time > > > > or fail 100% of the time. If a test passes in one run and fails in > > > > another run that either indicates a bug in the test or a bug in the > > > > software that is being tested. Examples of behaviors that can cause > > > > tests to behave unpredictably are use-after-free bugs and race > > > > conditions. How likely it is to trigger such behavior depends on a > > > > number of factors. This could even depend on external factors like which > > > > network packets are received from other systems. I do not expect that > > > > flaky tests have an exact failure rate. Hence my opinion that flaky > > > > tests are not useful and also that it is not useful to annotate flaky > > > > tests with a failure rate. If a test is flaky I think that the root > > > > cause of the flakiness must be determined and fixed. > > > > > > > > > > That is true for some use cases, but unfortunately, the flaky > > > fstests are way too valuable and too hard to replace or improve, > > > so practically, fs developers have to run them, but not everyone does. > > > > Everyone *should* be running them. They find *new bugs*, and it > > doesn't matter how old the kernel is. e.g. if you're backporting XFS > > log changes and you aren't running the "flakey" recoveryloop group > > tests, then you are *not testing failure handling log recovery > > sufficiently*. > > > > Where do you draw the line? recvoeryloop tests that shutdown and > > recover the filesystem will find bugs, they are guaranteed to be > > flakey, and they are *absolutely necessary* to be run because they > > are the only tests that exercise that critical filesysetm crash > > recovery functionality. > > > > What do we actually gain by excluding these "non-deterministic" > > tests from automated QA environments? > > > > Automated QA environment is a broad term. > We all have automated QA environments. > But there is a specific class of automated test env, such as CI build bots > that do not tolerate human intervention. Those environments need curated test lists because of the fact that failures gate progress. They also tend to run in resource limited environments as fstests is not the only set of tests that are run. Hence, generally speaking, CI is not an environment you'd be running a full "auto" group set of tests. Even the 'quick' group (which take an hour to run here) is often far too time and resource intensive for a CI system to use effectively. IOWs, we can't easily curate a set of tests that are appropriate for all CI environments - it's up to the people running the CI enviroment to determine what level of testing is appropriate for gating commits to their source tree, not the fstests maintainers or developers... > Is it enough to run only the deterministic tests to validate xfs code? > No it is not. > > My LTS environment is human monitored - I look at every failure and > analyse the logs and look at historic data to decide if they are regressions > or not. A bot simply cannot do that. > The bot can go back and run the test N times on baseline vs patch. > > The question is, do we want kernel test bot to run -g auto -x soak > on linux-next and report issues to us? > > I think the answer to this question should be yes. > > Do we want kernel test bot to run -g auto and report flaky test > failures to us? > > I am pretty sure that the answer is no. My answer to both is *yes, absolutely*. The zero-day kernel test bot runs all sorts of non-deterministic tests, including performance regression testing. We want these flakey/non-deterministic tests run in such environments, because they are often configurations we do not ahve access to and/or would never even consider. e.g. 128p server with a single HDD running IO scalability tests like AIM7... This is exactly where such automated testing provides developers with added value - it covers both hardware and software configs that indvidual developers cannot exercise themselves. Developers may or may not pay attention to those results depending on the test that "fails" and the hardware it "failed' on, but the point is that it got tested on something we'd never get coverage on otherwise. > > The test run policy mechanisms we already have avoid > > this whole can of worms - we don't need to care about the specific > > test requirements of any specific test enviroment because the > > default is inclusive and it is trivial to exclude tests from that > > default set if needed. > > > > I had the humble notion that we should make running fstests to > passing-by developers as easy as possible, because I have had the > chance to get feedback from some developers on their first time > attempt to run fstests and it wasn't pleasant, but nevermind. > -g auto -x soak is fine. I think that the way to do this is the way Ted has described - wrap fstests in an environment where all the required knowledge is already encapsulated and the "drive by testers" just need to crank the handle and it churns out results. As it is, I don't think making things easier for "drive-by" testing at the expense of making things arbitrarily different and/or harder for people who use it every day is a good trade-off. The "target market" for fstests is *filesystem developers* and people who spend their working life *testing filesystems*. The number of people who do *useful* "drive-by" testing of filesystems is pretty damn small, and IMO that niche is nowhere near as important as making things better for the people who use fstests every day.... > When you think about it, many fs developrs run ./check -g auto, > so we should not interfere with that, but I bet very few run './check'? > so we could make the default for './check' some group combination > that is as deterministic as possible. Bare ./check invocations are designed to run every test, regardless of what group they are in. Stop trying to redefine longstanding existing behaviour - if you want to define "deterministic" tests so that you can run just those tests, define a group for it, add all the tests to it, and then document it in the README as "if you have no experience with fstests, this is where you should start". Good luck keeping that up to date, though, as you're now back to the same problem that Ted describes, which is the "deterministic" group changes based on kernel, filesystem, config, etc. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx