On 2022.05.03 10:59, Jeff Hostetler wrote: > > > On 4/28/22 5:46 PM, Junio C Hamano wrote: > > Josh Steadmon <steadmon@xxxxxxxxxx> writes: > > > > > In rare cases, wait_or_whine() cannot determine a child process's exit > > > status (and will return -1 in this case). This can cause Git to issue > > > trace2 child_exit events despite the fact that the child is still > > > running. > > I'm curious what is causing the spurious return values. > Could you instrument wait_or_whine() and see which of the > if/else arms are causing the -1 to be returned? > > That routine is rather complicated and looks like it has 3 > different ways that a -1 could be returned. Unfortunately we haven't been able to reproduce the issue, so instrumentation won't clarify the cause unless we randomly hit another instance. > > > > Rather, we do not even know if the child is still running when it > > happens, right? It is curious what "rare cases" makes the symptom > > appear. Do we know? > > > > The patch looks OK from the "we do not know the child exited in this > > case, so we shouldn't be reporting the child exit" point of view, of > > course. Having one event that started a child in the log and then > > having millions of events that reports the exit of the (same) child > > is way too broken. With this change, we remove these phoney exit > > events from the log. > > > > Do we know, for such a child process that caused these millions > > phoney exit events, we got a real exit event at the end? Otherwise, > > we'd still have a similar problem in the opposite direction, i.e. a > > child has a start event recorded, many exit event discarded but the > > log lacks the true exit event for the child, implying that the child > > is still running because we failed to log its exit? > > > > > int finish_command_in_signal(struct child_process *cmd) > > > { > > > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > > > - trace2_child_exit(cmd, ret); > > > + if (ret != -1) > > > + trace2_child_exit(cmd, ret); > > > return ret; > > > } > > Since this is only called from pager.c and is used to setup the > pager, I have to wonder if you're only getting these spurious events > for the pager process or for any of the other random child processes. > > And whether they are received while the pager is alive and working > properly, or when you're trying to quit the pager or when the pager > is trying to signal eof. It was for a pager run by `git diff`. The only other child process in this instance was `git status`. I believe the events were generated while the user was not interacting with the process; IIRC the pager was left running in a terminal overnight and they discovered the next morning that their disk was filled up by trace logs.