On 2022.05.10 16:37, Jeff Hostetler wrote: > > > On 5/5/22 3:58 PM, Josh Steadmon wrote: > > On 2022.05.03 10:59, Jeff Hostetler wrote: > > > > > > > > > On 4/28/22 5:46 PM, Junio C Hamano wrote: > > > > Josh Steadmon <steadmon@xxxxxxxxxx> writes: > > > > > > > > > In rare cases, wait_or_whine() cannot determine a child process's exit > > > > > status (and will return -1 in this case). This can cause Git to issue > > > > > trace2 child_exit events despite the fact that the child is still > > > > > running. > > > > > > I'm curious what is causing the spurious return values. > > > Could you instrument wait_or_whine() and see which of the > > > if/else arms are causing the -1 to be returned? > > > > > > That routine is rather complicated and looks like it has 3 > > > different ways that a -1 could be returned. > > > > Unfortunately we haven't been able to reproduce the issue, so > > instrumentation won't clarify the cause unless we randomly hit another > > instance. > > > > > > > > > > > > Rather, we do not even know if the child is still running when it > > > > happens, right? It is curious what "rare cases" makes the symptom > > > > appear. Do we know? > > > > > > > > The patch looks OK from the "we do not know the child exited in this > > > > case, so we shouldn't be reporting the child exit" point of view, of > > > > course. Having one event that started a child in the log and then > > > > having millions of events that reports the exit of the (same) child > > > > is way too broken. With this change, we remove these phoney exit > > > > events from the log. > > > > > > > > Do we know, for such a child process that caused these millions > > > > phoney exit events, we got a real exit event at the end? Otherwise, > > > > we'd still have a similar problem in the opposite direction, i.e. a > > > > child has a start event recorded, many exit event discarded but the > > > > log lacks the true exit event for the child, implying that the child > > > > is still running because we failed to log its exit? > > > > > > > > > int finish_command_in_signal(struct child_process *cmd) > > > > > { > > > > > int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1); > > > > > - trace2_child_exit(cmd, ret); > > > > > + if (ret != -1) > > > > > + trace2_child_exit(cmd, ret); > > > > > return ret; > > > > > } > > > > > > Since this is only called from pager.c and is used to setup the > > > pager, I have to wonder if you're only getting these spurious events > > > for the pager process or for any of the other random child processes. > > > > > > And whether they are received while the pager is alive and working > > > properly, or when you're trying to quit the pager or when the pager > > > is trying to signal eof. > > > > It was for a pager run by `git diff`. The only other child process in > > this instance was `git status`. I believe the events were generated > > while the user was not interacting with the process; IIRC the pager was > > left running in a terminal overnight and they discovered the next > > morning that their disk was filled up by trace logs. > > > > This worries me that you've found a race or infinite loop here > and the only symptom is the child exit messages. Suppressing > them solves the disk-fill problem, but might mean that there's > a lurking infinite loop or something. > > Can you (privately) send me the part of the log before the > infinite loop? And maybe the parent and child processes too? > I'd kinda like to understand what's happening here. > > Jeff Unfortunately we no longer have the relevant logs. This incident happened in 2020 and the logs have expired. I can share some general details though: The session started on a Friday when the user ran `git diff`. Presumably the user left the pager open after they left for the weekend. On the following Saturday we started getting ~30M child_exit events per day until their disk filled up early Monday morning. The process was running on a VM, and I have reason to believe that the user's SSH connection to the VM was broken at some point on the Saturday that the spammy events started appearing. I don't know whether or not the user was using any sort of session manager / terminal multiplexer such as screen or tmux. I am curious whether or not the broken SSH connection could have created some kind of signal loop that caused the pager to continuously trigger our wait_for_pager_signal() handler. I don't have any useful info beyond that, sorry. Thanks for taking a look though!