Re: sharing fstests results (Was: [PATCH 5.15 CANDIDATE v2 0/8] xfs stable candidate patches for 5.15.y (part 1))

"Theodore Ts'o" <tytso@xxxxxxx> · Sat, 25 Jun 2022 17:50:26 -0400

On Sat, Jun 25, 2022 at 12:35:50PM -0700, Luis Chamberlain wrote:
> 
> The way the expunge list is process could simply be modified in kdevops
> so that non-deterministic tests are not expunged but also not treated as
> fatal at the end. But think about it, the exception is if the non-deterministic
> failure does not lead to a crash, no?

That's what I'm doing today, but once we have a better test analysis
system, what I think the only thing which should be excluded is:

   a)   bugs which cause the kernel to crash
   b)   test bugs
   c)   tests which take ***forever*** for a particular configuration
   	    (and for which  we probably get enough coverage through
	    other configs)

If we have a non-deterministic failure, which is due to a kernel bug,
I don't see any reason why we should skip the test.  We just need to
have a fully-featured enough test results analyzer so that we can
distinguish between known failures, known flaky failures, and new test
regressions.

So for example, the new tests generic/681, generic/682, and
generic/692 are causing determinsitic failures for the ext4/encrypt
config.  Right now, this is being tracked manually in a flat text
file:

generic/68[12]  encrypt   Failure percentage: 100%
    The directory does grow, but blocks aren't charged to either root or
    the non-privileged users' quota.  So this appears to be a real bug.
    Testing shows this goes all the way back to at least 4.14.

It's currently not tagged by kernel version, because I mostly only
care about upstream.  So once it's fixed upstream, I stop caring about
it.  In the ideal world, we'd track the kernel commit which fixed the
test failure, and when the fix propagated to the various stable
kernels, etc.

I've also resisted putting it in an expunge file, since if it did, I
would ignore it forever.  If it stays in my face, I'm more likely to
fix it, even if it's on my personal time.

> Here's the thing though. Not all developers have incentives to share.

Part of this is the amount of *time* that it takes to share this
information.  Right now, a lot of sharing takes place on the weekly
ext4 conference call.  It doesn't take Eric Whitney a lot of time to
mention that he's seeing a particular test failure, and I can quickly
search my test summary Unix mbox file and say, "yep, I've seen this
fail a couple of times before, starting in February 2020 --- but it's
super rare."

And since Darrick attends the weekly ext4 video chats, once or twice
we've asked him about some test failures on some esoteric xfs config,
such as realtime with an external logdev, and he might say, "oh yeah,
that's a known test bug.  pull this branch from my public xfstests
tree, I just haven't had time to push those fixes upstream yet."

(And I don't blame him for that; I just recently pushed some ext4 test
bug fixes, some of which I had initially sent to the list in late
April --- but on code review, changes were requested, and I just
didn't have *time* to clean up fixes in response to the code reviews.
So the fix which was good enough to suppress the failures sat in my
tree, but didn't go upstream since it was deemed not ready for
upstream.  I'm all for decreasing tech debt in xfstests; but do
understand that sometimes this means fixes to known test bugs will
stay in developers' git trees, since we're all overloaded.)

It's a similar problem with test failures.  Simply reporting a test
failure isn't *that* hard.  But the analysis, even if it's something
like:

generic/68[12]  encrypt   Failure percentage: 100%
    The directory does grow, but blocks aren't charged to either root or
    the non-privileged users' quota.....

... is the critical bit that people *really* want, and it takes real
developer time to come up with that kind of information.  In the ideal
world, I'd have an army of trained minions to run down this kind of
stuff.  In the real world, sometimes this stuff happens after
midnight, local time, on a Friday night.

(Note that Android and Chrome OS, both of which are big users of
fscrypt, don't use quota.  So If I were to open a bug tracker entry on
it, the bug would get prioritized to P2 or P3, and never be heard from
again, since there's no business reason to prioritize fixing it.
Which is why some of this happens on personal time.)

      	      	    	      	    	       - Ted