Re: [PATCH] run-command: don't spam trace2_child_exit()

Josh Steadmon <steadmon@xxxxxxxxxx> · Tue, 7 Jun 2022 11:45:11 -0700

On 2022.05.10 16:37, Jeff Hostetler wrote:
> 
> 
> On 5/5/22 3:58 PM, Josh Steadmon wrote:
> > On 2022.05.03 10:59, Jeff Hostetler wrote:
> > > 
> > > 
> > > On 4/28/22 5:46 PM, Junio C Hamano wrote:
> > > > Josh Steadmon <steadmon@xxxxxxxxxx> writes:
> > > > 
> > > > > In rare cases, wait_or_whine() cannot determine a child process's exit
> > > > > status (and will return -1 in this case). This can cause Git to issue
> > > > > trace2 child_exit events despite the fact that the child is still
> > > > > running.
> > > 
> > > I'm curious what is causing the spurious return values.
> > > Could you instrument wait_or_whine() and see which of the
> > > if/else arms are causing the -1 to be returned?
> > > 
> > > That routine is rather complicated and looks like it has 3
> > > different ways that a -1 could be returned.
> > 
> > Unfortunately we haven't been able to reproduce the issue, so
> > instrumentation won't clarify the cause unless we randomly hit another
> > instance.
> > 
> > 
> > > > 
> > > > Rather, we do not even know if the child is still running when it
> > > > happens, right?  It is curious what "rare cases" makes the symptom
> > > > appear.  Do we know?
> > > > 
> > > > The patch looks OK from the "we do not know the child exited in this
> > > > case, so we shouldn't be reporting the child exit" point of view, of
> > > > course.  Having one event that started a child in the log and then
> > > > having millions of events that reports the exit of the (same) child
> > > > is way too broken.  With this change, we remove these phoney exit
> > > > events from the log.
> > > > 
> > > > Do we know, for such a child process that caused these millions
> > > > phoney exit events, we got a real exit event at the end?  Otherwise,
> > > > we'd still have a similar problem in the opposite direction, i.e. a
> > > > child has a start event recorded, many exit event discarded but the
> > > > log lacks the true exit event for the child, implying that the child
> > > > is still running because we failed to log its exit?
> > > > 
> > > > >    int finish_command_in_signal(struct child_process *cmd)
> > > > >    {
> > > > >    	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> > > > > -	trace2_child_exit(cmd, ret);
> > > > > +	if (ret != -1)
> > > > > +		trace2_child_exit(cmd, ret);
> > > > >    	return ret;
> > > > >    }
> > > 
> > > Since this is only called from pager.c and is used to setup the
> > > pager, I have to wonder if you're only getting these spurious events
> > > for the pager process or for any of the other random child processes.
> > > 
> > > And whether they are received while the pager is alive and working
> > > properly, or when you're trying to quit the pager or when the pager
> > > is trying to signal eof.
> > 
> > It was for a pager run by `git diff`. The only other child process in
> > this instance was `git status`. I believe the events were generated
> > while the user was not interacting with the process; IIRC the pager was
> > left running in a terminal overnight and they discovered the next
> > morning that their disk was filled up by trace logs.
> > 
> 
> This worries me that you've found a race or infinite loop here
> and the only symptom is the child exit messages.  Suppressing
> them solves the disk-fill problem, but might mean that there's
> a lurking infinite loop or something.
> 
> Can you (privately) send me the part of the log before the
> infinite loop?  And maybe the parent and child processes too?
> I'd kinda like to understand what's happening here.
> 
> Jeff

Unfortunately we no longer have the relevant logs. This incident
happened in 2020 and the logs have expired. I can share some general
details though:

The session started on a Friday when the user ran `git diff`. Presumably
the user left the pager open after they left for the weekend. On the
following Saturday we started getting ~30M child_exit events per day
until their disk filled up early Monday morning.

The process was running on a VM, and I have reason to believe that the
user's SSH connection to the VM was broken at some point on the Saturday
that the spammy events started appearing. I don't know whether or not
the user was using any sort of session manager / terminal multiplexer
such as screen or tmux. I am curious whether or not the broken SSH
connection could have created some kind of signal loop that caused the
pager to continuously trigger our wait_for_pager_signal() handler.

I don't have any useful info beyond that, sorry. Thanks for taking a
look though!