[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2014-05-14 at 09:35 -0400, Dave Jones wrote:
> On Wed, May 14, 2014 at 05:26:29PM +1000, Michael Ellerman wrote:
> 
>  > >  > Not sure what the correct fix is.
>  > > 
>  > > I think just clearing mainpid before we call exit is the right thing to
>  > > do here.  I'll audit all the other exit() calls too, as this might be a
>  > > problem in other paths.
>  > 
>  > Thanks. That fix is working for me.
>  > 
>  > It still exits after a minute or so, because it fails to fork a child in
>  > fork_children().
>  > 
>  > I have 64 cpus and 16GB of RAM, so that's only 250MB per child.
>  > 
>  > If I reduce to 32 children then it runs much longer.
>  > 
>  > I wonder though, should failing to fork a child be a fatal error? Or could it
>  > just skip that child and continue.
> 
> Maybe.  It could wait until another child exits before retrying.
> Something like the patch below maybe.  I think I tried something like
> this before though, and it resulted in a flood of failed forks.
> 
> Let me know how this work out.

Sorry I didn't get back to you on this. I've been chasing a bug that trinity
found for us.

Running aae6d6a I've seen this once, but only once:

[watchdog] Sanity check failed! Found pid 1885550132!
[watchdog] problem checking on pid 112 (1:Operation not permitted)
[watchdog] pid 1885550132 has disappeared (oom-killed maybe?). Reaping.
[watchdog] pid 678326126 has disappeared (oom-killed maybe?). Reaping.
[watchdog] pid 1697185792 has disappeared (oom-killed maybe?). Reaping.
[watchdog] Reaped 3 dead children
Killed

cheers


--
To unsubscribe from this list: send the line "unsubscribe trinity" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux