On Wed, Jul 06, 2022 at 01:11:16PM +0300, Amir Goldstein wrote: > > So I am wondering what is the status today, because I rarely > see fstests failure reports from kernel test bot on the list, but there > are some reports. > > Does anybody have a clue what hw/fs/config/group of fstests > kernel test bot is running on linux-next? Te zero-day test bot only reports test regressions. So they have some list of tests that have failed in the past, and they only report *new* test failures. This is not just true for fstests, but it's also true for things like check and compiler warnings warnings --- and I suspect it's for those sorts of reports that caused the zero-day bot to keep state, and to filter out test failures and/or check warnings and/or compiler warnings, so that only new test failures and/or new compiler warnigns are reported. If they didn't, they would be spamming kernel developers, and given how.... "kind and understanding" kernel developers are at getting spammed, especially when sometimes the complaints are bogus ones (either test bugs or compiler bugs), my guess is that they did the filtering out of sheer self-defense. It certainly wasn't something requested by a file system developer as far as I know. So this is how I think an automated system for "drive-by testers" should work. First, the tester would specify the baseline/origin tag, and the testing system would run the tests on the baseline once. Hopefully, the test runner already has exclude files so that kernel bugs that cause an immediate kernel crash or deadlock would be already be in the exclude list. But as I've discovered this weekend, for file systems that I haven't tried in a few yeas, like udf, or ubifs. etc. there may be missing tests that result in the test VM to stop responding and/or crash. I have a planned improvement where if you are using the gce-xfstests's lightweight test manager, since the LTM is constantly reading the serial console, a deadlock can be detected and the LTM can restart the VM. The VM can then disambiguate from a forced reboot caused by the LTM, or a forced shutdown caused by the use of a preemptible VM (a planned feature not yet fully implemented yet), and the test runner can skip the tests already run, and skip the test which caused the crash or deadlock, and this could be reported so that eventually, the test could be added to the exclude file to benefit thouse people who are using kvm-xfstests. (This is an example of a planned improvement in xfstests-bld which if someone is interested in helping to implement it, they should give me a ring.) Once the tests which are failing given a particular baseline are known, this state would then get saved, and then now the tests can be run on the drive-by developer's changes. We can now compare the known failures for the baseline, with the changed kernels, and if there are any new failures, there are two possibilities: (a) this was a new feailure caused by the drive-by developer's changes, (b) this was a pre-existing known flake. To disambiguate between these two cases, we now run the failed test N times (where N is probably something like 10-50 times; I normally use 25 times) on the changed kernel, and get the failure rate. If the failure rate is 100%, then this is almost certainly (a). If the failure rate is < 100% (and greater than 0%), then we need to rerun the failed test on the baseline kernel N times, and see if the failure rate is 0%, then we should do a bisection search to determine the guilty commit. If the failure rate is 0%, then this is either an extremely rare flake, in which case we might need to increase N --- or it's an example of a test failure which is sensitive to the order of tests which are failed, in which case we may need to reun all of the tests in order up to the failed test. This is right now what I do when processing patches for upstream. It's also rather similar to what we're doing for the XFS stable backports, because it's much more efficient than running the baseline tests 100 times (which can take a week of continuous testing per Luis's comments) --- we only tests dozens (or more) times where a potential flake has been found, as opposed to *all* tests. It's all done manually, but it would be great if we could automate this to make life easier for XFS stable backporters, and *also* for drive-by developers. And again, if anyone is interested in helping with this, especially if you're familiar with shell, python 3, and/or the Go language, please contact me off-line. Cheers, - Ted