Re: [ANNOUNCE] Git v2.33.0-rc2 (Build/Test Report)

Jeff King <peff@xxxxxxxx> · Mon, 16 Aug 2021 17:55:01 -0400

On Mon, Aug 16, 2021 at 02:54:14PM -0400, Randall S. Becker wrote:

> >That 60 seconds is the timeout from t5562/invoke-with-content-length.
> >
> >So one, are you sure it's hanging forever, and not just for 60 seconds?
> 
> Absolutely sure. 48 hours because I forgot to check.
> 
> >And two, it is quite obvious there's some racing here. I'm not sure if this is indicative of a problem in the test suite, or in http-backend
> >itself (in which case it could be affecting real users).
> 
> How can I help track this down?

Here's what I found out so far. For my 60-second lag case, the test
_does_ complete as expected; it just takes a long time. So I think what
happens is this:

  - the invoke-with-content-length script sets up a SIGCLD handler

  - then it kicks off http-backend and writes to it

  - then it sleeps for 60 seconds, assuming that SIGCLD will interrupt
    the sleep

  - after the sleep finishes (whether by 60 seconds or because it was
    interrupted by the signal), we check a flag to see if our SIGCLD
    handler was called. If not, then we complain.

This usually completes instantaneously-ish, because the signal
interrupts our sleep. But very occasionally the child process dies
_before_ we hit the sleep, so we don't realize it.

So ideally we'd have some way of atomically checking our flag and then
sleeping only if it's not set. But I don't think that exists. The
closest we can come is using a series of smaller sleeps and checks. And
indeed, digging in the archive shows that Max already proposed such a
patch:

  https://lore.kernel.org/git/20190218205028.32486-1-max@xxxxxxxxxx/

It looks like it feel through the cracks, though. Maybe now is a good
time to resurrect it.

However, you are in that thread, too, and it didn't help your situation.
So I think your race is somehow different. It looks like there was some
weirdness around close() for you, though generally we _shouldn't_ be
hitting that close() at all, because we'd have gotten SIGCLD and set the
$exited flag in the interim.

-Peff