On Mon, Aug 16, 2021 at 02:54:14PM -0400, Randall S. Becker wrote: > >That 60 seconds is the timeout from t5562/invoke-with-content-length. > > > >So one, are you sure it's hanging forever, and not just for 60 seconds? > > Absolutely sure. 48 hours because I forgot to check. > > >And two, it is quite obvious there's some racing here. I'm not sure if this is indicative of a problem in the test suite, or in http-backend > >itself (in which case it could be affecting real users). > > How can I help track this down? Here's what I found out so far. For my 60-second lag case, the test _does_ complete as expected; it just takes a long time. So I think what happens is this: - the invoke-with-content-length script sets up a SIGCLD handler - then it kicks off http-backend and writes to it - then it sleeps for 60 seconds, assuming that SIGCLD will interrupt the sleep - after the sleep finishes (whether by 60 seconds or because it was interrupted by the signal), we check a flag to see if our SIGCLD handler was called. If not, then we complain. This usually completes instantaneously-ish, because the signal interrupts our sleep. But very occasionally the child process dies _before_ we hit the sleep, so we don't realize it. So ideally we'd have some way of atomically checking our flag and then sleeping only if it's not set. But I don't think that exists. The closest we can come is using a series of smaller sleeps and checks. And indeed, digging in the archive shows that Max already proposed such a patch: https://lore.kernel.org/git/20190218205028.32486-1-max@xxxxxxxxxx/ It looks like it feel through the cracks, though. Maybe now is a good time to resurrect it. However, you are in that thread, too, and it didn't help your situation. So I think your race is somehow different. It looks like there was some weirdness around close() for you, though generally we _shouldn't_ be hitting that close() at all, because we'd have gotten SIGCLD and set the $exited flag in the interim. -Peff