On Tue, 2014-05-13 at 10:00 -0400, Dave Jones wrote: > On Tue, May 13, 2014 at 04:43:48PM +1000, Michael Ellerman wrote: > > > I'm consistently ending up with a watchdog that is spinning using 100% cpu. > > > > We are bailing out of __check_main() before clearing shm->mainpid because we > > see that we are already exiting. > > > > if (ret == -1) { > > /* Are we already exiting ? */ > > if (shm->exit_reason != STILL_RUNNING) > > return FALSE; > > > > /* No. Check what happened. */ > > if (errno == ESRCH) { > > > > > > 161 if (shm->exit_reason != STILL_RUNNING) > > (gdb) print shm->exit_reason > > $6 = EXIT_FORK_FAILURE > > > > It looks like the only other place shm->mainpid is written is in > > trinity.c:main(), which is dead. So we are stuck forever as far as I can tell. > > Argh. I hit this exactly once a few weeks back, and thought I had fixed it. > > > The last thing in trinity.log is: > > > > [main] couldn't create child! (Cannot allocate memory) > > > > >From main.c:69: > > > > output(0, "couldn't create child! (%s)\n", strerror(errn o)); > > shm->exit_reason = EXIT_FORK_FAILURE; > > exit(EXIT_FAILURE); > > > > > > So we exited directly and didn't let the code in main() clear shm->mainpid. > > > > Not sure what the correct fix is. > > I think just clearing mainpid before we call exit is the right thing to > do here. I'll audit all the other exit() calls too, as this might be a > problem in other paths. Thanks. That fix is working for me. It still exits after a minute or so, because it fails to fork a child in fork_children(). I have 64 cpus and 16GB of RAM, so that's only 250MB per child. If I reduce to 32 children then it runs much longer. I wonder though, should failing to fork a child be a fatal error? Or could it just skip that child and continue. cheers -- To unsubscribe from this list: send the line "unsubscribe trinity" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html