Re: [PATCH] run-command: don't spam trace2_child_exit()

Josh Steadmon <steadmon@xxxxxxxxxx> · Thu, 5 May 2022 12:44:23 -0700

On 2022.04.28 14:46, Junio C Hamano wrote:
> Josh Steadmon <steadmon@xxxxxxxxxx> writes:
> 
> > In rare cases, wait_or_whine() cannot determine a child process's exit
> > status (and will return -1 in this case). This can cause Git to issue
> > trace2 child_exit events despite the fact that the child is still
> > running.
> 
> Rather, we do not even know if the child is still running when it
> happens, right?

Correct, if you'd like me to clarify the commit message I'll send a V2.

> It is curious what "rare cases" makes the symptom
> appear.  Do we know?

Unfortunately, no. The quoted 80 million exit event instance was not
reproducible.

> The patch looks OK from the "we do not know the child exited in this
> case, so we shouldn't be reporting the child exit" point of view, of
> course.  Having one event that started a child in the log and then
> having millions of events that reports the exit of the (same) child
> is way too broken.  With this change, we remove these phoney exit
> events from the log.
> 
> Do we know, for such a child process that caused these millions
> phoney exit events, we got a real exit event at the end?

We don't know. The trace log filled up the user's disk in this case, so
the log was truncated.

> Otherwise,
> we'd still have a similar problem in the opposite direction, i.e. a
> child has a start event recorded, many exit event discarded but the
> log lacks the true exit event for the child, implying that the child
> is still running because we failed to log its exit?

Yes, that is a weakness with this approach.

> >  int finish_command_in_signal(struct child_process *cmd)
> >  {
> >  	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> > -	trace2_child_exit(cmd, ret);
> > +	if (ret != -1)
> > +		trace2_child_exit(cmd, ret);
> >  	return ret;
> >  }
> 
> Will queue; thanks.

Thanks, and sorry for the delayed reply, I've been out sick for a few
days.