On Sat, May 11, 2024 at 09:17:41AM +0200, Patrick Steinhardt wrote: > On Fri, May 10, 2024 at 04:13:48PM -0400, Jeff King wrote: > > On Fri, May 10, 2024 at 09:00:04AM +0200, Patrick Steinhardt wrote: > > > > > On Thu, May 09, 2024 at 12:25:44PM -0400, Jeff King wrote: > > > [snip] > > > > I'd like to report that this let me get a successful CI run, but I'm > > > > running into the thing where osx jobs seem to randomly hang sometimes > > > > and hit the 6-hour timeout. But I did confirm that this lets us get to > > > > the actual build/test, and not barf while installing dependencies. > > > > > > Yeah, this one is puzzling to me. We see the same thing on GitLab CI, > > > and until now I haven't yet figured out why that is. > > > > Drat. I was hoping maybe it was a problem in GitHub CI and somebody else > > would eventually fix it. ;) > > > > It feels like a deadlock somewhere, though whether it is in our code, or > > in our tests, or some system-ish issue with prove, perl, etc, I don't > > know. It would be nice to catch it in the act and see what the process > > tree looks like. I guess poking around in the test environment with > > tmate might work, though I don't know if there's a way to get tmate > > running simultaneously with the hung step (so you'd probably have to > > connect, kick off the "make test" manually and hope it hangs). > > My hunch tells me that it's the Perforce tests -- after all, this is > where the jobs get stuck, too. In "lib-git-p4.sh" we already document > that p4d is known to crash at times, and overall the logic to spawn the > server is quite convoluted. > > I did try to get more useful logs yesterday. But as usual, once you > _want_ to reproduce a failure like this is doesn't happen anymore. I was spending (or rather wasting?) some more time on this. With the below diff I was able to get a list of processes running after ~50 minutes: diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh index 98dda42045..d5570b59d3 100755 --- a/ci/run-build-and-tests.sh +++ b/ci/run-build-and-tests.sh @@ -51,8 +51,15 @@ esac group Build make if test -n "$run_tests" then - group "Run tests" make test || + ( + sleep 3200 && + mkdir -p t/failed-test-artifacts && + ps -A >t/failed-test-artifacts/ps 2>&1 + ) & + pid=$! + group "Run tests" gtimeout 1h make test || handle_failed_tests + kill "$pid" fi check_unignored_build_artifacts I trimmed that process list to the following set of relevant processes: PID TTY TIME CMD 5196 ?? 0:00.01 /bin/sh t9211-scalar-clone.sh --verbose-log -x 5242 ?? 0:00.00 /bin/sh t9211-scalar-clone.sh --verbose-log -x 5244 ?? 0:00.00 tee -a /Volumes/RAMDisk/test-results/t9211-scalar-clone.out 5245 ?? 0:00.09 /bin/sh t9211-scalar-clone.sh --verbose-log -x 7235 ?? 0:00.02 /Users/gitlab/builds/gitlab-org/git/scalar clone file:///Volumes/RAMDisk/trash directory.t9211-scalar-clone/to-clone maint-fail 7265 ?? 0:00.01 /Users/gitlab/builds/gitlab-org/git/git fetch --quiet --no-progress origin 7276 ?? 0:00.01 /Users/gitlab/builds/gitlab-org/git/git fsmonitor--daemon run --detach --ipc-threads=8 So it seems like the issue is t9211, and the hang happens in "scalar clone warns when background maintenance fails" specifically. What exactly the root cause is I have no clue though. Maybe an fsmonitor race, maybe something else entirely. Hard to say as I have never seen this happen on any other platform than macOS, and I do not have access to a Mac myself. The issue also doesn't seem to occur when running t9211 on its own, but only when running the full test suite. This may further indicate that there is a race condition, where the additional load improves the likelihood of it. Or there is bad interaction with another test. Patrick
Attachment:
signature.asc
Description: PGP signature