On Fri, May 10, 2024 at 12:41 PM Jeff King <peff@xxxxxxxx> wrote: > > On Fri, May 10, 2024 at 10:22:43AM -0700, Emily Shaffer wrote: > > > From: Emily Shaffer <nasamuffin@xxxxxxxxxx> > > > > We already use trace2 to find out about unexpected pipe breakages, which > > is nice for detecting bugs or system problems, by adding a handler for > > SIGPIPE which simply writes a trace2 line. However, there are a handful > > of other common signals which we might want to snoop on: > > > > - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in > > frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`) > > - SIGHUP, when the network closes unexpectedly (indicating there may be > > a problem to solve) > > > > There are lots more signals which we might find useful later, but at > > least let's teach trace2 to report these egregious ones. Conveniently, > > they're also already covered by the `_common` variants in sigchain.[ch]. > > I think this would be a useful thing to have, but having looked at the > trace2 signal code, this is almost certain to cause racy deadlocks. > > The exact details depend on the specific trace2 target backend, but > looking at the various fn_signal() methods, they rely on allocations via > strbufs. This is a problem in signal handlers because we can get a > signal at any time, including when other code is inside malloc() holding > a lock. And then further calls to malloc() will block forever on that > lock. > > We should be able to do a quick experiment. Try this snippet, which > repeatedly kills "git log -p" (which is likely to be allocating memory) > and waits for it to exit. Eventually each invocation will stall on a > deadlock: > > -- >8 -- > doit() { > me=$1 > i=0 > while true; do > GIT_TRACE2=1 ./git log -p >/dev/null 2>&1 & > sleep 0.1 > kill $! > wait $! 2>/dev/null > i=$((i+1)) > echo $me:$i > done > } > > for i in $(seq 1 64); do > doit $i & > done > -- >8 -- > > I didn't have the patience to wait for all of them to stall, but if you > let it run for a bit and check "ps", you'll see some git processes which > are hanging. Stracing shows them stuck on a lock, like: > > $ strace -p 1838693 > strace: Process 1838693 attached > futex(0x7facf02df3e0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 1838693 detached > <detached ...> > > This problem existed before your patch. I imagine it was much less > likely (or perhaps even impossible) with SIGPIPE though, because we'd > see that signal only when in a write() syscall, which implies we're not > in malloc(). Whereas we can get SIGTERM, etc, any time. > > Obviously the script above is meant to exacerbate the situation, and > most runs would be fine. But over the course of normal use across many > users and many runs, I think we would see this in practice. I think your > test won't because it triggers the signal only from raise(). > > So I think before doing this, we'd need to clean up the trace2 signal > code to avoid any allocations. I started to look into doing this, and it's actually really tricky. I've got a sample diff here (https://github.com/git/git/commit/bf8a5084ede2b9c476e0cf90b7f198c52573fba7); I'll need to do it for the other two trace formats as well. But, the entire trace2 library relies heavily on strbuf, which doesn't have a stack-allocated form. I'm also not sure how we can guarantee the no-alloc-ness of these - maybe there's some flag we can give to one of the analyzers or something? - so I'm worried about backsliding in the future. Anyway, I won't have time to work on these again until the end of next week. If this looks like a reasonable direction I'll pick it up again then; otherwise, maybe it makes sense for the fn_signal() dispatcher to just time out if the handler process doesn't terminate in, say, 1s? - Emily > > -Peff