Re: [PATCH] t7610: fix flaky timeout issue, don't clone from example.com

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 10 Nov 2022 00:55:30 +0100

On Wed, Nov 09 2022, Taylor Blau wrote:

> On Sat, Nov 05, 2022 at 12:54:21PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> The behavior of "-N" here might be surprising to some, since it's
>> explained as "[if you use -N we] don’t fetch new objects from the
>> remote site". But (perhaps counter-intuitively) it's only talking
>> about if it needs to do so via "git fetch". In this case we'll end up
>> spawning a "git clone", as we have no submodule set up.
>
> Makes sense, though I'm not sure I agree this is worth patching as I
> currently understand it.
>
> If I run t7610 today with '--run=2-' (IOW, skipping the setup test), I
> am definitely going to get failures. And I don't think we should have
> every subsequent test depend on having run anything containing "setup"
> before it. That is, it is not surprising that we will see some test
> failures when carving up and running portions of the test instead of the
> whole file.
>
> I don't think this is substantively any different than that. Whether we
> don't complete the "setup" test because we had some leak (and ran the
> test suite with the appropriate LSan options), or because we neglected
> to run it at all, I don't think there is a significant difference.
>
> Either way, the end-state of the test repository isn't guaranteed to
> match the intent of the "setup" test.
>
> If this is the only such problem in-tree, sure, I think it's fine to
> patch. But I would wager that there are *many* more than just this one
> lurking, and patching all of them would be less straightforward than
> this one.
>
> So... I don't know. I'm certainly leaning negative on this patch, but if
> you have some compelling reason that I'm missing, I'm all-ears.

I agree with that in general, but the expected failure case for all
those other things that are broken with the missing setup is just that
we'll predictably error out immediately.

Whereas in this case we'll end up hanging on a connect() to some box
that IANA's maintaining for the example.com landing page, and after we
reach the connect() timeout we move onto the next failing test, which
will do that again. The test takes a long time to finally report an
overall failure.

I think that's worth fixing in general, aside from the leak
detection. I.e. yes in general we aren't good about running tests 3..9
successfully if the 2nd test fails.

But we generally just fail some or all of 3..9 pretty fast, and don't
start taking 20 minutes to run the test, when it took 10s before (or
whatever).