Re: crash sometimes doesn't terminate, loops forever looking for a process that doesn't exist

Dave Anderson <anderson@xxxxxxxxxx> · Mon, 07 Nov 2011 09:24:53 -0500 (EST)

----- Original Message -----
> Dear crash-utility,
> 
> In our vmcore analysis infrastructure we stumbled on a case where crash
> doesn't terminate. When examining the state of the process with gdb
> it seems to be looping forever over /proc/$pid/stat in an attempt to determine
> the PID of a process that doesn't exist any more.
> 
> The backtrace:
> (gdb) bt
> #0  0x00007fb9814fea57 in munmap () from /lib/libc.so.6
> #1  0x00007fb9814a30aa in _IO_setb () from /lib/libc.so.6
> #2  0x00007fb9814a1d18 in _IO_file_close_it () from /lib/libc.so.6
> #3  0x00007fb981495a48 in fclose () from /lib/libc.so.6
> #4  0x00000000004fe75b in output_command_to_pids () at cmdline.c:775
> #5  0x00000000004fed7c in setup_redirect (origin=1) at cmdline.c:519
> #6  0x00000000005012bb in process_command_line () at cmdline.c:149
> #7  0x000000000045f575 in main_loop () at main.c:610
> #8  0x0000000000541af9 in captured_command_loop (data=0x7fb982282000) at ./main.c:226
> #9  0x000000000053fd8b in catch_errors (func=<value optimized out>, func_args=<value optimized out>, errstring=<value optimized out>, mask=<value optimized out>) at exceptions.c:520
> #10 0x00000000005415b6 in captured_main (data=<value optimized out>) at ./main.c:924
> #11 0x000000000053fd8b in catch_errors (func=<value optimized out>,
> func_args=<value optimized out>, errstring=<value optimized out>, mask=<value optimized out>) at exceptions.c:520
> #12 0x0000000000540994 in gdb_main (args=0x1000) at ./main.c:939
> #13 0x00000000005409ce in gdb_main_entry (argc=<value optimized out>, argv=0x1000) at ./main.c:959
> #14 0x000000000046025a in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:525
> 
> The problematic code:
>  720 /*
>  721  *  Determine the pids of the current popen'd shell and output command.
>  722  *  This is all done using /proc; the ps kludge at the bottom of 
>  this
>  723  *  routine is legacy, and should only get executed if /proc doesn't exist.
>  724  */
>  725 static int
>  726 output_command_to_pids(void)
>  727 {
>  ...
>  738         int retries;
>  739
>  740         retries = 0;
>  741         pc->pipe_pid = pc->pipe_shell_pid = 0;
>  742         sprintf(lookfor, "(%s)", pc->pipe_command);
>  743         stall(1000);
>  744 retry:
>  745         if (is_directory("/proc") && (dirp = opendir("/proc")))
>  {
>  746                 for (dp = readdir(dirp); dp && !pc->pipe_pid;
>  747                      dp = readdir(dirp)) {
>  748                         if (!decimal(dp->d_name, 0))
>  749                                 continue;
>  750                         sprintf(buf1, "/proc/%s/stat",
>  dp->d_name);
>  751                         if (file_exists(buf1, NULL) &&
>  752                             (stp = fopen(buf1, "r"))) {
>  753                                 if (fgets(buf2, BUFSIZE, stp)) {
>  754                                         pid = strtok(buf2, " ");
>  755                                         name = strtok(NULL, "
>  ");
>  756                                         status = strtok(NULL, "
>  ");
>  757                                         p_pid = strtok(NULL, "
>  ");
>  758                                         pgrp = strtok(NULL, "
>  ");
>  759                                         if (STREQ(name, "(sh)")
>  &&
>  760                                             (atoi(p_pid) ==  getpid()))
>  761
> pc->pipe_shell_pid = atoi(pid);
>  762                                         if (STREQ(name, lookfor)
>  &&
>  763                                             ((atoi(p_pid) ==
>  getpid()) ||
>  764                                              (atoi(p_pid) ==
> pc->pipe_shell_pid)
>  765                                              || (atoi(pgrp) ==
> getpid()))) {
>  766                                                 pc->pipe_pid =
>  atoi(pid);
>  767                                                 console(
>  768                             "FOUND[%d] (%d->%d->%d) %s %s p_pid:
> %s pgrp: %s\n",
>  769                                                     retries,
>  getpid(),
>  770
>                                                      pc->pipe_shell_pid,
>  771
>                                                      pc->pipe_pid,
>  772                                                     name,
>  status,
> p_pid, pgrp);
>  773                                         }
>  774                                 }
>  775                                 fclose(stp);
>  776                         }
>  777                 }
>  778                 closedir(dirp);
>  779         }
>  780
>  781         if (!pc->pipe_pid && ((retries++ < 10) || pc->pipe_shell_pid)) {
>  782                 stall(1000);
>  783                 goto retry;
>  784         }
> 
> Looking at how many times it has been looping over /proc:
> (gdb) p retries
> $19 = 138056108
> 
> It found the PID of the shell but not of the command:
> (gdb) p pc->pipe_shell_pid
> $20 = 9306
> (gdb) p pc->pipe_pid
> $21 = 0
> 
> For completeness the command that was being run was looking like this:
> (gdb) p pc->orig_line
> $26 = "log | grep -A1 'some string' >> /some/file", '\000' [...]
> 
> So it seems something like this happened:
> +>popen(grep)
> +--> fork(); execve(sh)
> +---> fork(); execve(grep)
> +----> grep exit()s for some reason
> +>crash(8) finds sh in /proc
> +---> sh exit
> +>crash(8) keeps looking for grep in /proc
> 
> I have a second core showing a similar situation if that's of any use but
> now we just work around the problem by wrapping crash(8) within timeout(1).
> 
> We could try and fix that function to bail out when the shell exits
> but it really doesn't look like a nice way to do it to me. So I looked
> at the reasons we want the PID of that command and it seems there are
> two:
> 
> * determining whether the process is still alive
> This can be done by checking whether the intervening shell is still alive.
> Obtaining only the PID of the shell seems less problematic than trying to
> get the PID of the grandchildren. At worst reimplementing popen()
> to store the PID of sh is not exactly hard.
> 
> * terminating the process forcibly (SIGKILL)
> This is done in close_output() which is only called from within restart() when
> its argument is not SIGSEGV, SIGPIPE, SIGINT or 0. I cannot find that function
> being set as a signal handler for anything else or being called with an
> argument different from 0. As far as I can tell this is dead code.
> 
> Before I write/test/submit a patch, am I missing something or would it make
> sense to get rid of that pipe_pid?
> 
> Thanks,
> Adrien

Hi Adrien,

I would be a little hesitant to get rid of the pc->pipe_pid at this point
in time.  

I can't seem to be able to reproduce it, but certainly there should 
be an escape valve in output_commands_to_pid() to recognize it and bail
out.  But I presume that your piped command sequence actually worked, 
and so it would be strange/unnecessary for setup_redirect() to do the 
error(FATAL_RESTART, ...) that it currently does when output_commands_to_pid()
returns with a NULL?

Anyway, my point is to try to keep the fix as simple as possible...  

Thanks,
  Dave

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility