On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote: > > I strongly disagree with annotating tests with failure rates. My opinion is > that on a given test setup a test either should pass 100% of the time or > fail 100% of the time. My opinion is also that no child should ever go to bed hungry, and we should end world hunger. However, meanwhile, in the real world, while we can *strive* to eliminate all flaky tests, whether it is caused by buggy tests, or buggy kernel code, there's an old saying that the only time code is bug-free is when it is no longer being used. That being said, I completely agree that annotating failure rates in xfstesets-dev upstream probably doesn't make much sense. As we've stated before, it is highly dependent on the hardware configuration, and kernel version (remember, sometimes flaky tests are caused by bugs in other kernel subsystems --- including the loop device, which has not historically been bug-free(tm) either, and so bugs come and go across the entire kernel surface). I believe the best way to handle this is to have better test results analysis tools. We can certainly consider having some shared test results database, but I'm not convinced that flat text files shared via git is sufficiently scalable. The final thing I'll note it that we've lived with low probability flakes for a very long time, and it hasn't been the end of the world. Sometime in 2011 or 2012, when I first started at Google and when we first started rolling out ext4 to the all of our data centers, once or twice a month --- across the entire world-wide fleet --- there would be an unexplained file system corruption that had remarkably similar characteristics. It took us several months to run it down, and it turned out to be a lock getting released one C statement too soon. When I did some further archeological research, it turned out it had been in upstream for well over a *decade* --- in ext3 and ext4 --- and had not been noticed in at least 3 or 4 enterprise distro GA testing/qualification cycles. Or rather, it might have been noticed, but since it couldn't be replicated, I'm guessing the QA testers shrugged, assumed that it *must* have been due to some cosmic ray, or some such, and moved on. > If a test is flaky I think that the root cause of the flakiness must > be determined and fixed. In the ideal world, sure. Then again, in the ideal world, we wouldn't have thousands of people getting killed over border disputes and because some maniacal world leader thinks that it's A-OK to overrun the borders of adjacent countries. However, until we have infinite resources available to us, the reality is that we need to live with the fact that life is imperfect, despite all of our efforts to reduce these sort of flaky tests --- especially when we're talking about esoteric test configurations that most users won't be using. (Or when they are triggered by test code that is not used in production, but for which the error injection or shutdown simuilation code is itself not perfect.) Cheers, - Ted