Re: [PATCH v2] run-command: don't spam trace2_child_exit()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 6/7/22 2:21 PM, Josh Steadmon wrote:
In rare cases[1], wait_or_whine() cannot determine a child process's
status (and will return -1 in this case). This can cause Git to issue
trace2 child_exit events despite the fact that the child may still be
running. In pathological cases, we've seen > 80 million exit events in
our trace logs for a single child process.

Fix this by only issuing trace2 events in finish_command_in_signal() if
we get a value other than -1 from wait_or_whine(). This can lead to
missing child_exit events in such a case, but that is preferable to
duplicating events on a scale that threatens to fill the user's
filesystem with invalid trace logs.

[1]: This can happen when:

* waitpid() returns -1 and errno != EINTR
* waitpid() returns an invalid PID
* the status set by waitpid() has neither the WIFEXITED() nor
   WIFSIGNALED() flags

Signed-off-by: Josh Steadmon <steadmon@xxxxxxxxxx>
---
Updated the commit message with more details about when wait_or_whine()
can fail.

  run-command.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/run-command.c b/run-command.c
index a8501e38ce..e0fe2418a2 100644
--- a/run-command.c
+++ b/run-command.c
@@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd)
  int finish_command_in_signal(struct child_process *cmd)
  {
  	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
-	trace2_child_exit(cmd, ret);
+	if (ret != -1)
+		trace2_child_exit(cmd, ret);
  	return ret;
  }
base-commit: faa21c10d44184f616d391c158dcbb13b9c72ef3



I think there are larger bugs lurking here.

finish_command_in_signal() is only called from
wait_for_pager_signal(), which is loaded up by
sigchain_push_common() when the pager is started.

There are no loops or other callers here.

We sigchain-pop our handler upon first use, so even if we
were getting millions of PIPE signals from other chilld
processes, we shouldn't be getting more than 1 message
here.  (Assuming that our pop is working and our call to
raise() doesn't cause us to be called again.)

The code is behaving as if the "s->n--" at the bottom of
sigchain_pop() isn't being performed and our raise() is
calling us again.

I originally thought it might be due to unbalanced push/pops -- which
could happen if the process were in a "push; do something; pop"
sequence and the PIPE happened in the middle of that.  But I don't
think that matters.

So I have to wonder if the sigchain code has a bug.

The Linux man-page states that signal(2) is not thread-safe
and even if it were, the sigchain code here does not have any
locking on those global data structures.

But that might not be the problem here and it's hard to say without
any of the original trace data.

Having said all of that I think it would be better to fix the
original problem something like this (not tested):

 static void wait_for_pager_signal(int signo)
 {
+	static int handled = 0;
+
+	if (handled)
+		return;
+	handled = 1;
+
	close_pager_fds();
	finish_command_in_signal(&pager_process);
	sigchain_pop(signo);
	raise(signo);
 }

Jeff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux