On Fri, May 10, 2024 at 10:22:43AM -0700, Emily Shaffer wrote: > From: Emily Shaffer <nasamuffin@xxxxxxxxxx> > > We already use trace2 to find out about unexpected pipe breakages, which > is nice for detecting bugs or system problems, by adding a handler for > SIGPIPE which simply writes a trace2 line. However, there are a handful > of other common signals which we might want to snoop on: > > - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in > frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`) > - SIGHUP, when the network closes unexpectedly (indicating there may be > a problem to solve) > > There are lots more signals which we might find useful later, but at > least let's teach trace2 to report these egregious ones. Conveniently, > they're also already covered by the `_common` variants in sigchain.[ch]. I think this would be a useful thing to have, but having looked at the trace2 signal code, this is almost certain to cause racy deadlocks. The exact details depend on the specific trace2 target backend, but looking at the various fn_signal() methods, they rely on allocations via strbufs. This is a problem in signal handlers because we can get a signal at any time, including when other code is inside malloc() holding a lock. And then further calls to malloc() will block forever on that lock. We should be able to do a quick experiment. Try this snippet, which repeatedly kills "git log -p" (which is likely to be allocating memory) and waits for it to exit. Eventually each invocation will stall on a deadlock: -- >8 -- doit() { me=$1 i=0 while true; do GIT_TRACE2=1 ./git log -p >/dev/null 2>&1 & sleep 0.1 kill $! wait $! 2>/dev/null i=$((i+1)) echo $me:$i done } for i in $(seq 1 64); do doit $i & done -- >8 -- I didn't have the patience to wait for all of them to stall, but if you let it run for a bit and check "ps", you'll see some git processes which are hanging. Stracing shows them stuck on a lock, like: $ strace -p 1838693 strace: Process 1838693 attached futex(0x7facf02df3e0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 1838693 detached <detached ...> This problem existed before your patch. I imagine it was much less likely (or perhaps even impossible) with SIGPIPE though, because we'd see that signal only when in a write() syscall, which implies we're not in malloc(). Whereas we can get SIGTERM, etc, any time. Obviously the script above is meant to exacerbate the situation, and most runs would be fine. But over the course of normal use across many users and many runs, I think we would see this in practice. I think your test won't because it triggers the signal only from raise(). So I think before doing this, we'd need to clean up the trace2 signal code to avoid any allocations. -Peff