On Sat, Jun 25, 2022 at 12:35:50PM -0700, Luis Chamberlain wrote: > > The way the expunge list is process could simply be modified in kdevops > so that non-deterministic tests are not expunged but also not treated as > fatal at the end. But think about it, the exception is if the non-deterministic > failure does not lead to a crash, no? That's what I'm doing today, but once we have a better test analysis system, what I think the only thing which should be excluded is: a) bugs which cause the kernel to crash b) test bugs c) tests which take ***forever*** for a particular configuration (and for which we probably get enough coverage through other configs) If we have a non-deterministic failure, which is due to a kernel bug, I don't see any reason why we should skip the test. We just need to have a fully-featured enough test results analyzer so that we can distinguish between known failures, known flaky failures, and new test regressions. So for example, the new tests generic/681, generic/682, and generic/692 are causing determinsitic failures for the ext4/encrypt config. Right now, this is being tracked manually in a flat text file: generic/68[12] encrypt Failure percentage: 100% The directory does grow, but blocks aren't charged to either root or the non-privileged users' quota. So this appears to be a real bug. Testing shows this goes all the way back to at least 4.14. It's currently not tagged by kernel version, because I mostly only care about upstream. So once it's fixed upstream, I stop caring about it. In the ideal world, we'd track the kernel commit which fixed the test failure, and when the fix propagated to the various stable kernels, etc. I've also resisted putting it in an expunge file, since if it did, I would ignore it forever. If it stays in my face, I'm more likely to fix it, even if it's on my personal time. > Here's the thing though. Not all developers have incentives to share. Part of this is the amount of *time* that it takes to share this information. Right now, a lot of sharing takes place on the weekly ext4 conference call. It doesn't take Eric Whitney a lot of time to mention that he's seeing a particular test failure, and I can quickly search my test summary Unix mbox file and say, "yep, I've seen this fail a couple of times before, starting in February 2020 --- but it's super rare." And since Darrick attends the weekly ext4 video chats, once or twice we've asked him about some test failures on some esoteric xfs config, such as realtime with an external logdev, and he might say, "oh yeah, that's a known test bug. pull this branch from my public xfstests tree, I just haven't had time to push those fixes upstream yet." (And I don't blame him for that; I just recently pushed some ext4 test bug fixes, some of which I had initially sent to the list in late April --- but on code review, changes were requested, and I just didn't have *time* to clean up fixes in response to the code reviews. So the fix which was good enough to suppress the failures sat in my tree, but didn't go upstream since it was deemed not ready for upstream. I'm all for decreasing tech debt in xfstests; but do understand that sometimes this means fixes to known test bugs will stay in developers' git trees, since we're all overloaded.) It's a similar problem with test failures. Simply reporting a test failure isn't *that* hard. But the analysis, even if it's something like: generic/68[12] encrypt Failure percentage: 100% The directory does grow, but blocks aren't charged to either root or the non-privileged users' quota..... ... is the critical bit that people *really* want, and it takes real developer time to come up with that kind of information. In the ideal world, I'd have an army of trained minions to run down this kind of stuff. In the real world, sometimes this stuff happens after midnight, local time, on a Friday night. (Note that Android and Chrome OS, both of which are big users of fscrypt, don't use quota. So If I were to open a bug tracker entry on it, the bug would get prioritized to P2 or P3, and never be heard from again, since there's no business reason to prioritize fixing it. Which is why some of this happens on personal time.) - Ted