Re: [RFC: kdevops] Standardizing on failure rate nomenclature for expunges

"Theodore Ts'o" <tytso@xxxxxxx> · Sun, 3 Jul 2022 09:32:03 -0400

On Sat, Jul 02, 2022 at 02:48:12PM -0700, Bart Van Assche wrote:
> 
> I strongly disagree with annotating tests with failure rates. My opinion is
> that on a given test setup a test either should pass 100% of the time or
> fail 100% of the time.

My opinion is also that no child should ever go to bed hungry, and we
should end world hunger.

However, meanwhile, in the real world, while we can *strive* to
eliminate all flaky tests, whether it is caused by buggy tests, or
buggy kernel code, there's an old saying that the only time code is
bug-free is when it is no longer being used.

That being said, I completely agree that annotating failure rates in
xfstesets-dev upstream probably doesn't make much sense.  As we've
stated before, it is highly dependent on the hardware configuration,
and kernel version (remember, sometimes flaky tests are caused by bugs
in other kernel subsystems --- including the loop device, which has
not historically been bug-free(tm) either, and so bugs come and go
across the entire kernel surface).

I believe the best way to handle this is to have better test results
analysis tools.  We can certainly consider having some shared test
results database, but I'm not convinced that flat text files shared
via git is sufficiently scalable.

The final thing I'll note it that we've lived with low probability
flakes for a very long time, and it hasn't been the end of the world.
Sometime in 2011 or 2012, when I first started at Google and when we
first started rolling out ext4 to the all of our data centers, once or
twice a month --- across the entire world-wide fleet --- there would
be an unexplained file system corruption that had remarkably similar
characteristics.  It took us several months to run it down, and it
turned out to be a lock getting released one C statement too soon.
When I did some further archeological research, it turned out it had
been in upstream for well over a *decade* --- in ext3 and ext4 --- and
had not been noticed in at least 3 or 4 enterprise distro GA
testing/qualification cycles.  Or rather, it might have been noticed,
but since it couldn't be replicated, I'm guessing the QA testers
shrugged, assumed that it *must* have been due to some cosmic ray, or
some such, and moved on.

> If a test is flaky I think that the root cause of the flakiness must
> be determined and fixed.  

In the ideal world, sure.  Then again, in the ideal world, we wouldn't
have thousands of people getting killed over border disputes and
because some maniacal world leader thinks that it's A-OK to overrun
the borders of adjacent countries.

However, until we have infinite resources available to us, the reality
is that we need to live with the fact that life is imperfect, despite
all of our efforts to reduce these sort of flaky tests --- especially
when we're talking about esoteric test configurations that most users
won't be using.  (Or when they are triggered by test code that is not
used in production, but for which the error injection or shutdown
simuilation code is itself not perfect.)

Cheers,

					- Ted