Re: [PATCH] trace2: intercept all common signals

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 10, 2024 at 10:22:43AM -0700, Emily Shaffer wrote:

> From: Emily Shaffer <nasamuffin@xxxxxxxxxx>
> 
> We already use trace2 to find out about unexpected pipe breakages, which
> is nice for detecting bugs or system problems, by adding a handler for
> SIGPIPE which simply writes a trace2 line. However, there are a handful
> of other common signals which we might want to snoop on:
> 
>  - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in
>    frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`)
>  - SIGHUP, when the network closes unexpectedly (indicating there may be
>    a problem to solve)
> 
> There are lots more signals which we might find useful later, but at
> least let's teach trace2 to report these egregious ones. Conveniently,
> they're also already covered by the `_common` variants in sigchain.[ch].

I think this would be a useful thing to have, but having looked at the
trace2 signal code, this is almost certain to cause racy deadlocks.

The exact details depend on the specific trace2 target backend, but
looking at the various fn_signal() methods, they rely on allocations via
strbufs. This is a problem in signal handlers because we can get a
signal at any time, including when other code is inside malloc() holding
a lock. And then further calls to malloc() will block forever on that
lock.

We should be able to do a quick experiment. Try this snippet, which
repeatedly kills "git log -p" (which is likely to be allocating memory)
and waits for it to exit. Eventually each invocation will stall on a
deadlock:

-- >8 --
doit() {
	me=$1
	i=0
	while true; do
		GIT_TRACE2=1 ./git log -p >/dev/null 2>&1 &
		sleep 0.1
		kill $!
		wait $! 2>/dev/null
		i=$((i+1))
		echo $me:$i
	done
}

for i in $(seq 1 64); do
	doit $i &
done
-- >8 --

I didn't have the patience to wait for all of them to stall, but if you
let it run for a bit and check "ps", you'll see some git processes which
are hanging. Stracing shows them stuck on a lock, like:

  $ strace -p 1838693
  strace: Process 1838693 attached
  futex(0x7facf02df3e0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 1838693 detached
   <detached ...>

This problem existed before your patch. I imagine it was much less
likely (or perhaps even impossible) with SIGPIPE though, because we'd
see that signal only when in a write() syscall, which implies we're not
in malloc(). Whereas we can get SIGTERM, etc, any time.

Obviously the script above is meant to exacerbate the situation, and
most runs would be fine. But over the course of normal use across many
users and many runs, I think we would see this in practice. I think your
test won't because it triggers the signal only from raise().

So I think before doing this, we'd need to clean up the trace2 signal
code to avoid any allocations.

-Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux