On Tue, Jun 07 2022, Josh Steadmon wrote: > In rare cases[1], wait_or_whine() cannot determine a child process's > status (and will return -1 in this case). This can cause Git to issue > trace2 child_exit events despite the fact that the child may still be > running. In pathological cases, we've seen > 80 million exit events in > our trace logs for a single child process. > > Fix this by only issuing trace2 events in finish_command_in_signal() if > we get a value other than -1 from wait_or_whine(). This can lead to > missing child_exit events in such a case, but that is preferable to > duplicating events on a scale that threatens to fill the user's > filesystem with invalid trace logs. > > [1]: This can happen when: > > * waitpid() returns -1 and errno != EINTR > * waitpid() returns an invalid PID > * the status set by waitpid() has neither the WIFEXITED() nor > WIFSIGNALED() flags > > Signed-off-by: Josh Steadmon <steadmon@xxxxxxxxxx> > --- > Updated the commit message with more details about when wait_or_whine() > can fail. > > run-command.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/run-command.c b/run-command.c > index a8501e38ce..e0fe2418a2 100644 > --- a/run-command.c > +++ b/run-command.c > @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd) > int finish_command_in_signal(struct child_process *cmd) > { > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > - trace2_child_exit(cmd, ret); > + if (ret != -1) > + trace2_child_exit(cmd, ret); > return ret; > } This seems like a legitimate issue, but I really don't think we should sweep this under the rug like this. * Why can't we see if we logged such an event already in common_exit(), if we didn't we should trace2_child_exit() (or similar). I.e. not miss an event, ever. * Should this really be an "exit" event, aren't some of these failed signal events? Per the "should this be an exit event?" question in my related "signal on BUG" series. * We should have tests here, e.g. in t0210 to see the exact events we emit in certain cases, we really should have a test for this. Perhaps we can instrument a simulated failure with some GIT_TEST_* variables?