Re: Is t5516 somehow flakey only on macOS?

SZEDER Gábor <szeder.dev@xxxxxxxxx> · Sat, 9 Jan 2021 18:33:36 +0100

On Sat, Jan 09, 2021 at 05:48:10AM -0500, Jeff King wrote:
> On Sat, Jan 09, 2021 at 05:34:05AM -0500, Jeff King wrote:
> 
> > For this _particular_ test, since we know that it is testing a v0-only
> > behavior, we might want to just loosen the test. This goes against the
> > point of adding it in 014ade7484 (upload-pack: send ERR packet for
> > non-tip objects, 2019-04-13), but it's the best we can do for now.
> > Something like this:
> 
> Since this issue has been languishing for a while now with several
> "something like this" patches, I've packaged it up into something more
> palatable. I think we should just apply this and move on. We may still
> run into other similar races, but I don't think this one is worth
> spending more mental effort on.
> 
> -- >8 --
> Subject: [PATCH] t5516: loosen "not our ref" error check
> 
> Commit 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> 2019-04-13) added a test that greps the output of a failed fetch to make
> sure that upload-pack sent us the ERR packet we expected. But checking
> this is racy; despite the argument in that commit, the client may still
> be sending a "done" line when the server exits, causing it to die() on a

Nit: I think using the word "after" would make the problematic
sequence of events a tad clearer, i.e. "... after the server has
exited, ...".

> failed write() and never see the ERR packet at all.
> 
> This fails quite rarely on Linux, but more often on macOS. However, it
> can be triggered reliably with:
> 
> 	diff --git a/fetch-pack.c b/fetch-pack.c
> 	index 876f90c759..cf40de9092 100644
> 	--- a/fetch-pack.c
> 	+++ b/fetch-pack.c
> 	@@ -489,6 +489,7 @@ static int find_common(struct fetch_negotiator *negotiator,
> 	 done:
> 	 	trace2_region_leave("fetch-pack", "negotiation_v0_v1", the_repository);
> 	 	if (!got_ready || !no_done) {
> 	+		sleep(1);
> 	 		packet_buf_write(&req_buf, "done\n");
> 	 		send_request(args, fd[1], &req_buf);
> 	 	}

FWIW (not much?), I've run the test suite with that sleep(1) in place,
and there were no other test failures.

> This is a real user-visible race that it would be nice to fix, but it's
> tricky to do so: the client would have to speculatively try to read an
> ERR packet after hitting a write() error. And at least for this error,
> it's specific to v0 (since v2 does not enforce reachability at all).
> 
> So let's loosen to test to avoid annoying racy failures. If we
> eventually do the read-after-failed-write thing, we can tighten it. And
> if not, v0 will grow increasingly obsolete as servers support v2, so the
> utility of this test will decrease over time anyway.

Makes sense.  Back then when I investigated this issue the default
protocol was still v0; now that we default to v2 I agree its better to
work around the issue in the test instead of "fixing" the root cause
with that "trying to read ERR packet on error" hack.

Good, a year-and-a-half old entry checked off from my todo list :)
Thanks.