Re: [PATCH 2/2] diff tests: rewrite flakyness-causing test "aid"

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 15 Apr 2021 15:21:14 +0200

On Thu, Apr 15 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:
>
>> What I am saying is that it's incompatible to have:
>>
>>  1. Failing tests
>>  2. Not removing scratch files that would otherwise be removed
>>
>> And:
>>
>>  3. Knowing that the rest of the tests pass in the case of #1 without
>>     reading them all.
>>
>> Hence the suggestion that we should use test_when_finished without
>> exception for such patterns.
>
> I disagree with the above; t4013's "read magic cmd" part is designed
> to be independent from each other; I do not think you need to read
> all of the parts enclosed in << ... EOF to understand that.
>
> In short,
>
>  * "test_when_finished 'rm ...'" is a good tool to ensure something
>    gets removed no matter what else happens in the same test.  Since
>    it does not run the clean-up under "-i", it can even be used on
>    files that would be useful in diagnosing failures.  But under
>    "-d", it does clean-up to avoid affecting the following test.
>
>  * $actual was made unique so that even under "-d", a failing test
>    would not negatively affect the subsequent ones. [...]

I don't mind disagreeing with you, but I do feel like we're just talking
past each other, either that or we've got a different understanding of
"negatively affect the subsequent ones".

I'm saying that yes I agree that *right now* it doesn't negatively
affect any of the subsequent ones, but that you can only know that if
you need all the tests in the file past the one you're modifying.

I.e. imagine if we added, obviously contrived, but a stand-in for "we
expect consistent state":

	diff --git a/t/t4013-diff-various.sh b/t/t4013-diff-various.sh
	index 6cca8b84a6b..a326968e8f3 100755
	--- a/t/t4013-diff-various.sh
	+++ b/t/t4013-diff-various.sh
	@@ -526,4 +526,9 @@ test_expect_success 'diff -I<regex>: detect malformed regex' '
	 	test_i18ngrep "invalid regex given to -I: " error
	 '

	+test_expect_success 'stuff' '
	+	find * -type f >actual &&
	+	test_line_count = 9 actual
	+'
	+
	 test_done

Now run it, and you'll pass them all, now break one of the earlier tests
with:

    >t4013/diff.log_-GF_-p_--pickaxe-all_master

You'll fail two tests (the expected one, and the new one), instead of
just one, which is purely an artifact of the earlier test's cleanup /
state being different as a result of whether it passes or not.

With my patch you'll fail just the one earlier test, because we do the
same cleanup no matter whether we failed midway through it or not.

So yes, while this small series amounts to re-arranging some deck chairs
on the Titanic I do think not being able to rely on consistent state is
a big anti-pattern in the test suite in general.

It's a big part of the reason for why you can't reliably use --run=N or
GIT_SKIP_TESTS (especially in cases of partial failures) to selectively
skip things without what amounts to bespoke analysis of each test you'd
like to run a subset of.

Now, of course test_when_finished won't get us there by itself, we'd
also need consistently use some "test_expect_setup" to clearly delineate
"setup" and other "this must run" state changes from "just consumes
existing state and resets". But consistently writing new/modified tests
as "just consumes existing state and resets" will slowly get us there.

>    [...] Removing it for
>    failure cases is actively wrong, so use of test_when_finished,
>    which may be an expedient way to remove artifact that would
>    negatively affect later test pieces, does not apply --- existing
>    code is doing better than test_when_finished can offer, and
>    replacing it with test_when_finished is a regression.

Aside from whether we agree or disagree on whether test_when_finished
should be consistently used to clean up state I genuinely still don't
see how the sum total of replacing the current "rm && cp" with
"test_when_finished && BUG" leaves us worse off.

Are you thinking of the case where someone wants to add N new lines to
the "while read magic cmd" loop?

That's the only case I can think of that would arguably be worse, but I
think it's far outweighed by the more obvious failure mode, and a "git
blame" on the lines it consumes shows that additions are very occasional
(and not in large batches).

Other than that, under e.g. -vixd having the test suite stop on the
specific line that failed (the test_cmp, not a "false" after some "cp")
is how most other tests works, and I think makes things much more
readable and understandable at a glance.

>  * And the most important part: the unnecessary removal of $actual
>    was not even part of the "flakyness-causing" bug you started this
>    topic to fix anyway.  We should just remove the regression caused
>    by unnecessary use of test_when_finished and focus on the "don't
>    place the actual output from a brand new test under the file used
>    for the expectation the next time---instead place it under
>    temporary file and call for attention" part, which was the real
>    improvement.

I see your point and agree in theory that the "we write into the
t/t4013/ directory" and "we inconsistently clear up the trash directory
on OK/NOK" are in principle different issues.

But in this case we always removed or copied "$actual" away from its
original name before, we do the same now (just more reliably, and in a
way that doesn't cause flakyness in the "cp" case).

So I don't really see how to split up those two arguably different
changes in a way that would make sense.

The only case where "$actual" was left before was (I think) a clear of
leaking state unnecessarily outside of --immediate.

>>> Please do not advocate for and encourage newbies who would be
>>> reading the discussion from sidelines to use test_when_finished out
>>> of dogmatic principle without thinking.
>>
>> Is Matheus the newbie here? I think he's contributed enough to form his
>> own opinion.
>
> No, those "reading the discussion from sidelines" are not on To: or
> Cc:, but are eager to learn by reading what is available to git@
> subscribers (including the lore archive).  I do not want those of us
> whose name appear often in the list archive to be sending a wrong
> signal to them.

I'll endeavor to make it clear that my opinion is just my opinion
etc. in the future, I thought it was clear from the context :)

>> In any case, I think it's best to just drop this series.
>
> That is a sad and wrong conclusion.  We should just realize what we
> really wanted to fix in the first place and keep the improvement;
> otherwise all the review time was wasted.

To be clear that's not some "I can't have the change I want so I'm
taking my toys and going home" tantrum. I don't mind making changes in
response to reviews etc.

As noted above I'm genuinely still unable to see how I'd selectively
untangle the change I made in a way that both wouldn't either fix bugs
in the test, or introduce new bugs while I was at it, and that I'd be
comfortable putting my name behind.

Hence the suggestion that since we're already we past the point of just
having come up with two independent approaches to this in terms of time
investment, that doing so would be a better thing overall.

I do think (and would be willing to work on) that t/README should have
some summary of best practices for new/updated tests that we should be
striving for. Looking at our "test_when_finished" there's at least tow
independent bugs[1] in it, so ther's some room for improvement.

1. It creates a ref with echo, but removes it with update-ref -d, which
   e.g. couldn't delete a corrupt ref (if rev-parse is buggy); and more
   importantly if rev-parse dies midway through we're left with a
   corrupt repo, the "test_when_finished ... update-ref -d" should come
   *first*.