Re: [PATCH 3/3] ci: stop installing "gcc-13" for osx-gcc

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, May 16, 2024 at 02:36:19PM +0200, Patrick Steinhardt wrote:

> I was spending (or rather wasting?) some more time on this. With the
> below diff I was able to get a list of processes running after ~50
> minutes:

I was going to say "good, now I don't have to waste time on it". But
your findings only nerd-sniped me into digging more. ;)

> So it seems like the issue is t9211, and the hang happens in "scalar
> clone warns when background maintenance fails" specifically. What
> exactly the root cause is I have no clue though. Maybe an fsmonitor
> race, maybe something else entirely. Hard to say as I have never seen
> this happen on any other platform than macOS, and I do not have access
> to a Mac myself.
> 
> The issue also doesn't seem to occur when running t9211 on its own, but
> only when running the full test suite. This may further indicate that
> there is a race condition, where the additional load improves the
> likelihood of it. Or there is bad interaction with another test.

I can reproduce it at will and relatively quickly using "--stress" with
t9211. I pushed up a hacky commit that removes all CI jobs except for
os-clang, and it stops short of running the build/tests and opens a
shell using tmate. For reference (though you'd need to work out
something similar for GitLab).

  https://github.com/peff/git/commit/f825fa36ed95bed414b0d6d9e8b21799e2e167e4

And then just:

  make -j8
  cd t
  ./t9211-scalar-clone.sh --stress

Give it a minute or two, and you'll see most of the jobs have hung, with
one or two "winners" continuing (once most of them are hanging, the load
is low enough that the race doesn't happen). So you'll see 3.17, 3.18,
3.19, and so on, indicating that job 3 is still going and completing its
19th run. But everything else is stuck and stops producing output.

You can likewise see processes in "ps" that are a few minutes old, which
is another way to find the stuck ones. And I get the same three
processes as you: scalar clone, fetch, and fsmonitor--daemon.

And here's where I ran into tooling issues.

Normally I'd "strace -p" to see what the hung processes are doing. We
don't have that on macOS. Doing "sudo dtruss -p" runs without complaint,
but it looks like it doesn't report on the current syscall (where we're
presumably blocking).

I installed gdb, which does seem to work, but attaching to the running
processes doesn't show a useful backtrace (even after making sure to
build with "-g -O0", and confirming that regular "gdb ./git" works OK).

One can guess that scalar is in waitpid() waiting for git-fetch. But
what's fetch waiting on? The other side of upload-pack is dead.
According to lsof, it does have a unix socket open to fsmonitor. So
maybe it's trying to read there?

Curiously killing fsmonitor doesn't un-stick fetch, and nor does killing
fetch unstick scalar. So either my guesses above are wrong, or there's
something else weird causing them to hang.

I imagine there may be better tools to poke at things, but I'm at the
limits of my macOS knowledge. But maybe the recipe above is enough for
somebody more clueful to recreate and investigate the situation (it
probably would also be easy to just run the --stress script locally if
somebody actually has a mac).

-Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux