On Wed, Jul 6, 2022 at 5:30 PM Theodore Ts'o <tytso@xxxxxxx> wrote: > > On Wed, Jul 06, 2022 at 01:11:16PM +0300, Amir Goldstein wrote: > > > > So I am wondering what is the status today, because I rarely > > see fstests failure reports from kernel test bot on the list, but there > > are some reports. > > > > Does anybody have a clue what hw/fs/config/group of fstests > > kernel test bot is running on linux-next? > > Te zero-day test bot only reports test regressions. So they have some > list of tests that have failed in the past, and they only report *new* > test failures. This is not just true for fstests, but it's also true > for things like check and compiler warnings warnings --- and I suspect > it's for those sorts of reports that caused the zero-day bot to keep > state, and to filter out test failures and/or check warnings and/or > compiler warnings, so that only new test failures and/or new compiler > warnigns are reported. If they didn't, they would be spamming kernel > developers, and given how.... "kind and understanding" kernel > developers are at getting spammed, especially when sometimes the > complaints are bogus ones (either test bugs or compiler bugs), my > guess is that they did the filtering out of sheer self-defense. It > certainly wasn't something requested by a file system developer as far > as I know. > > > So this is how I think an automated system for "drive-by testers" > should work. First, the tester would specify the baseline/origin tag, > and the testing system would run the tests on the baseline once. > Hopefully, the test runner already has exclude files so that kernel > bugs that cause an immediate kernel crash or deadlock would be already > be in the exclude list. But as I've discovered this weekend, for file > systems that I haven't tried in a few yeas, like udf, or > ubifs. etc. there may be missing tests that result in the test VM to > stop responding and/or crash. > > I have a planned improvement where if you are using the gce-xfstests's > lightweight test manager, since the LTM is constantly reading the > serial console, a deadlock can be detected and the LTM can restart the > VM. The VM can then disambiguate from a forced reboot caused by the > LTM, or a forced shutdown caused by the use of a preemptible VM (a > planned feature not yet fully implemented yet), and the test runner > can skip the tests already run, and skip the test which caused the > crash or deadlock, and this could be reported so that eventually, the > test could be added to the exclude file to benefit thouse people who > are using kvm-xfstests. (This is an example of a planned improvement > in xfstests-bld which if someone is interested in helping to implement > it, they should give me a ring.) > > Once the tests which are failing given a particular baseline are > known, this state would then get saved, and then now the tests can be > run on the drive-by developer's changes. We can now compare the known > failures for the baseline, with the changed kernels, and if there are > any new failures, there are two possibilities: (a) this was a new > feailure caused by the drive-by developer's changes, (b) this was a > pre-existing known flake. > > To disambiguate between these two cases, we now run the failed test N > times (where N is probably something like 10-50 times; I normally use > 25 times) on the changed kernel, and get the failure rate. If the > failure rate is 100%, then this is almost certainly (a). If the > failure rate is < 100% (and greater than 0%), then we need to rerun > the failed test on the baseline kernel N times, and see if the failure > rate is 0%, then we should do a bisection search to determine the > guilty commit. > > If the failure rate is 0%, then this is either an extremely rare > flake, in which case we might need to increase N --- or it's an > example of a test failure which is sensitive to the order of tests > which are failed, in which case we may need to reun all of the tests > in order up to the failed test. > > This is right now what I do when processing patches for upstream. > It's also rather similar to what we're doing for the XFS stable > backports, because it's much more efficient than running the baseline > tests 100 times (which can take a week of continuous testing per > Luis's comments) --- we only tests dozens (or more) times where a > potential flake has been found, as opposed to *all* tests. It's all > done manually, but it would be great if we could automate this to make > life easier for XFS stable backporters, and *also* for drive-by > developers. > This process sounds like it could get us to mostly unattended regression testing, so it sounds good. I do wonder if there is nothing more that fstests devlopers can do to assist when annotating new (and existing) tests to aid in that effort. For example, there might be a case to tag a test as "this is a very reliable test that should have no failures at all - if there is a failure then something is surely wrong". I wonder if it would help to have a group like that and how many tests would that group include. > And again, if anyone is interested in helping with this, especially if > you're familiar with shell, python 3, and/or the Go language, please > contact me off-line. > Please keep me in the loop if you have a prototype I may be able to help test it. Thanks, Amir.