----- Original Message ----- > Dear crash-utility, > > In our vmcore analysis infrastructure we stumbled on a case where crash > doesn't terminate. When examining the state of the process with gdb > it seems to be looping forever over /proc/$pid/stat in an attempt to determine > the PID of a process that doesn't exist any more. > > The backtrace: > (gdb) bt > #0 0x00007fb9814fea57 in munmap () from /lib/libc.so.6 > #1 0x00007fb9814a30aa in _IO_setb () from /lib/libc.so.6 > #2 0x00007fb9814a1d18 in _IO_file_close_it () from /lib/libc.so.6 > #3 0x00007fb981495a48 in fclose () from /lib/libc.so.6 > #4 0x00000000004fe75b in output_command_to_pids () at cmdline.c:775 > #5 0x00000000004fed7c in setup_redirect (origin=1) at cmdline.c:519 > #6 0x00000000005012bb in process_command_line () at cmdline.c:149 > #7 0x000000000045f575 in main_loop () at main.c:610 > #8 0x0000000000541af9 in captured_command_loop (data=0x7fb982282000) at ./main.c:226 > #9 0x000000000053fd8b in catch_errors (func=<value optimized out>, func_args=<value optimized out>, errstring=<value optimized out>, mask=<value optimized out>) at exceptions.c:520 > #10 0x00000000005415b6 in captured_main (data=<value optimized out>) at ./main.c:924 > #11 0x000000000053fd8b in catch_errors (func=<value optimized out>, > func_args=<value optimized out>, errstring=<value optimized out>, mask=<value optimized out>) at exceptions.c:520 > #12 0x0000000000540994 in gdb_main (args=0x1000) at ./main.c:939 > #13 0x00000000005409ce in gdb_main_entry (argc=<value optimized out>, argv=0x1000) at ./main.c:959 > #14 0x000000000046025a in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:525 > > The problematic code: > 720 /* > 721 * Determine the pids of the current popen'd shell and output command. > 722 * This is all done using /proc; the ps kludge at the bottom of > this > 723 * routine is legacy, and should only get executed if /proc doesn't exist. > 724 */ > 725 static int > 726 output_command_to_pids(void) > 727 { > ... > 738 int retries; > 739 > 740 retries = 0; > 741 pc->pipe_pid = pc->pipe_shell_pid = 0; > 742 sprintf(lookfor, "(%s)", pc->pipe_command); > 743 stall(1000); > 744 retry: > 745 if (is_directory("/proc") && (dirp = opendir("/proc"))) > { > 746 for (dp = readdir(dirp); dp && !pc->pipe_pid; > 747 dp = readdir(dirp)) { > 748 if (!decimal(dp->d_name, 0)) > 749 continue; > 750 sprintf(buf1, "/proc/%s/stat", > dp->d_name); > 751 if (file_exists(buf1, NULL) && > 752 (stp = fopen(buf1, "r"))) { > 753 if (fgets(buf2, BUFSIZE, stp)) { > 754 pid = strtok(buf2, " "); > 755 name = strtok(NULL, " > "); > 756 status = strtok(NULL, " > "); > 757 p_pid = strtok(NULL, " > "); > 758 pgrp = strtok(NULL, " > "); > 759 if (STREQ(name, "(sh)") > && > 760 (atoi(p_pid) == getpid())) > 761 > pc->pipe_shell_pid = atoi(pid); > 762 if (STREQ(name, lookfor) > && > 763 ((atoi(p_pid) == > getpid()) || > 764 (atoi(p_pid) == > pc->pipe_shell_pid) > 765 || (atoi(pgrp) == > getpid()))) { > 766 pc->pipe_pid = > atoi(pid); > 767 console( > 768 "FOUND[%d] (%d->%d->%d) %s %s p_pid: > %s pgrp: %s\n", > 769 retries, > getpid(), > 770 > pc->pipe_shell_pid, > 771 > pc->pipe_pid, > 772 name, > status, > p_pid, pgrp); > 773 } > 774 } > 775 fclose(stp); > 776 } > 777 } > 778 closedir(dirp); > 779 } > 780 > 781 if (!pc->pipe_pid && ((retries++ < 10) || pc->pipe_shell_pid)) { > 782 stall(1000); > 783 goto retry; > 784 } > > Looking at how many times it has been looping over /proc: > (gdb) p retries > $19 = 138056108 > > It found the PID of the shell but not of the command: > (gdb) p pc->pipe_shell_pid > $20 = 9306 > (gdb) p pc->pipe_pid > $21 = 0 > > For completeness the command that was being run was looking like this: > (gdb) p pc->orig_line > $26 = "log | grep -A1 'some string' >> /some/file", '\000' [...] > > So it seems something like this happened: > +>popen(grep) > +--> fork(); execve(sh) > +---> fork(); execve(grep) > +----> grep exit()s for some reason > +>crash(8) finds sh in /proc > +---> sh exit > +>crash(8) keeps looking for grep in /proc > > I have a second core showing a similar situation if that's of any use but > now we just work around the problem by wrapping crash(8) within timeout(1). > > We could try and fix that function to bail out when the shell exits > but it really doesn't look like a nice way to do it to me. So I looked > at the reasons we want the PID of that command and it seems there are > two: > > * determining whether the process is still alive > This can be done by checking whether the intervening shell is still alive. > Obtaining only the PID of the shell seems less problematic than trying to > get the PID of the grandchildren. At worst reimplementing popen() > to store the PID of sh is not exactly hard. > > * terminating the process forcibly (SIGKILL) > This is done in close_output() which is only called from within restart() when > its argument is not SIGSEGV, SIGPIPE, SIGINT or 0. I cannot find that function > being set as a signal handler for anything else or being called with an > argument different from 0. As far as I can tell this is dead code. > > Before I write/test/submit a patch, am I missing something or would it make > sense to get rid of that pipe_pid? > > Thanks, > Adrien Hi Adrien, I would be a little hesitant to get rid of the pc->pipe_pid at this point in time. I can't seem to be able to reproduce it, but certainly there should be an escape valve in output_commands_to_pid() to recognize it and bail out. But I presume that your piped command sequence actually worked, and so it would be strange/unnecessary for setup_redirect() to do the error(FATAL_RESTART, ...) that it currently does when output_commands_to_pid() returns with a NULL? Anyway, my point is to try to keep the fix as simple as possible... Thanks, Dave -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility