Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 6, 2022 at 5:30 PM Theodore Ts'o <tytso@xxxxxxx> wrote:
>
> On Wed, Jul 06, 2022 at 01:11:16PM +0300, Amir Goldstein wrote:
> >
> > So I am wondering what is the status today, because I rarely
> > see fstests failure reports from kernel test bot on the list, but there
> > are some reports.
> >
> > Does anybody have a clue what hw/fs/config/group of fstests
> > kernel test bot is running on linux-next?
>
> Te zero-day test bot only reports test regressions.  So they have some
> list of tests that have failed in the past, and they only report *new*
> test failures.  This is not just true for fstests, but it's also true
> for things like check and compiler warnings warnings --- and I suspect
> it's for those sorts of reports that caused the zero-day bot to keep
> state, and to filter out test failures and/or check warnings and/or
> compiler warnings, so that only new test failures and/or new compiler
> warnigns are reported.  If they didn't, they would be spamming kernel
> developers, and given how.... "kind and understanding" kernel
> developers are at getting spammed, especially when sometimes the
> complaints are bogus ones (either test bugs or compiler bugs), my
> guess is that they did the filtering out of sheer self-defense.  It
> certainly wasn't something requested by a file system developer as far
> as I know.
>
>
> So this is how I think an automated system for "drive-by testers"
> should work.  First, the tester would specify the baseline/origin tag,
> and the testing system would run the tests on the baseline once.
> Hopefully, the test runner already has exclude files so that kernel
> bugs that cause an immediate kernel crash or deadlock would be already
> be in the exclude list.  But as I've discovered this weekend, for file
> systems that I haven't tried in a few yeas, like udf, or
> ubifs. etc. there may be missing tests that result in the test VM to
> stop responding and/or crash.
>
> I have a planned improvement where if you are using the gce-xfstests's
> lightweight test manager, since the LTM is constantly reading the
> serial console, a deadlock can be detected and the LTM can restart the
> VM.  The VM can then disambiguate from a forced reboot caused by the
> LTM, or a forced shutdown caused by the use of a preemptible VM (a
> planned feature not yet fully implemented yet), and the test runner
> can skip the tests already run, and skip the test which caused the
> crash or deadlock, and this could be reported so that eventually, the
> test could be added to the exclude file to benefit thouse people who
> are using kvm-xfstests.  (This is an example of a planned improvement
> in xfstests-bld which if someone is interested in helping to implement
> it, they should give me a ring.)
>
> Once the tests which are failing given a particular baseline are
> known, this state would then get saved, and then now the tests can be
> run on the drive-by developer's changes.  We can now compare the known
> failures for the baseline, with the changed kernels, and if there are
> any new failures, there are two possibilities: (a) this was a new
> feailure caused by the drive-by developer's changes, (b) this was a
> pre-existing known flake.
>
> To disambiguate between these two cases, we now run the failed test N
> times (where N is probably something like 10-50 times; I normally use
> 25 times) on the changed kernel, and get the failure rate.  If the
> failure rate is 100%, then this is almost certainly (a).  If the
> failure rate is < 100% (and greater than 0%), then we need to rerun
> the failed test on the baseline kernel N times, and see if the failure
> rate is 0%, then we should do a bisection search to determine the
> guilty commit.
>
> If the failure rate is 0%, then this is either an extremely rare
> flake, in which case we might need to increase N --- or it's an
> example of a test failure which is sensitive to the order of tests
> which are failed, in which case we may need to reun all of the tests
> in order up to the failed test.
>
> This is right now what I do when processing patches for upstream.
> It's also rather similar to what we're doing for the XFS stable
> backports, because it's much more efficient than running the baseline
> tests 100 times (which can take a week of continuous testing per
> Luis's comments) --- we only tests dozens (or more) times where a
> potential flake has been found, as opposed to *all* tests.  It's all
> done manually, but it would be great if we could automate this to make
> life easier for XFS stable backporters, and *also* for drive-by
> developers.
>

This process sounds like it could get us to mostly unattended regression
testing, so it sounds good.

I do wonder if there is nothing more that fstests devlopers can do to
assist when annotating new (and existing) tests to aid in that effort.

For example, there might be a case to tag a test as "this is a very
reliable test that should have no failures at all - if there is a failure
then something is surely wrong".
I wonder if it would help to have a group like that and how many
tests would that group include.

> And again, if anyone is interested in helping with this, especially if
> you're familiar with shell, python 3, and/or the Go language, please
> contact me off-line.
>

Please keep me in the loop if you have a prototype I may be able
to help test it.

Thanks,
Amir.



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux