On Thu, 2014-05-22 at 12:40 +1000, Michael Ellerman wrote: > On Wed, 2014-05-14 at 09:35 -0400, Dave Jones wrote: > > On Wed, May 14, 2014 at 05:26:29PM +1000, Michael Ellerman wrote: > > > > > > > Not sure what the correct fix is. > > > > > > > > I think just clearing mainpid before we call exit is the right thing to > > > > do here. I'll audit all the other exit() calls too, as this might be a > > > > problem in other paths. > > > > > > Thanks. That fix is working for me. > > > > > > It still exits after a minute or so, because it fails to fork a child in > > > fork_children(). > > > > > > I have 64 cpus and 16GB of RAM, so that's only 250MB per child. > > > > > > If I reduce to 32 children then it runs much longer. > > > > > > I wonder though, should failing to fork a child be a fatal error? Or could it > > > just skip that child and continue. > > > > Maybe. It could wait until another child exits before retrying. > > Something like the patch below maybe. I think I tried something like > > this before though, and it resulted in a flood of failed forks. > > > > Let me know how this work out. > > Sorry I didn't get back to you on this. I've been chasing a bug that trinity > found for us. > > Running aae6d6a I've seen this once, but only once: And this one, which looks more fun :) $ trinity -q Trinity v1.5pre Dave Jones <davej@xxxxxxxxxx> Done parsing arguments. Marking all syscalls as enabled. [init] Enabled 323 syscalls. Disabled 0 syscalls. [init] Using pid_max = 65536 [init] Started watchdog process, PID is 47158 [main] Main thread is alive. [main] Registered 6 fd providers. [main] Couldn't find socket cachefile. Regenerating. [main] created 375 sockets [main] Generating file descriptors [main] Added 276 filenames from /dev [main] Something went wrong during nftw(/proc). (-1:Value too large for defined data type) [main] Added 10283 filenames from /sys [child30:56679] nfsservctl (168) returned ENOSYS, marking as inactive. [child30:56679] stat (18) returned ENOSYS, marking as inactive. [child1:56650] acct (51) returned ENOSYS, marking as inactive. [child10:56659] quotactl (131) returned ENOSYS, marking as inactive. [child28:56677] lstat (84) returned ENOSYS, marking as inactive. [child15:56664] sysctl (149) returned ENOSYS, marking as inactive. [watchdog] Watchdog is alive. (pid:47158) [child6:56655] ipc (117) returned ENOSYS, marking as inactive. [child11:56660] BUG!: CHILD (pid:56660) GOT REPARENTED! parent pid:47159. Watchdog pid:47158 [child11:56660] BUG!: Last syscalls: [child11:56660] [0] pid:56649 call:io_getevents callno:23 [child11:56660] [1] pid:56650 call:syslog callno:23 [child11:56660] [2] pid:56651 call:getxattr callno:78 [child11:56660] [3] pid:56652 call:set_mempolicy callno:3 [child11:56660] [4] pid:56653 call:getdents64 callno:12 [child11:56660] [5] pid:56654 call:setgroups callno:8 [child11:56660] [6] pid:56655 call:rt_sigpending callno:31 [child11:56660] [7] pid:56656 call:mmap callno:15 [child11:56660] [8] pid:56657 call:setxattr callno:16 [child11:56660] [9] pid:56658 call:delete_module callno:6 [child11:56660] [10] pid:56659 call:timer_delete callno:122 [child11:56660] [11] pid:56660 call:clock_getres callno:279 [child11:56660] [12] pid:56661 call:open callno:20 [child11:56660] [13] pid:56662 call:setregid callno:176 [child11:56660] [14] pid:56663 call:mount callno:24 [child11:56660] [15] pid:56664 call:mkdir callno:106 [child11:56660] [16] pid:56665 call:unshare callno:72 [child11:56660] [17] pid:56666 call:sched_get_priority_max callno:47 [child11:56660] [18] pid:56667 call:sched_getparam callno:158 [child11:56660] [19] pid:56668 call:linkat callno:38 [child11:56660] [20] pid:56669 call:utime callno:13 [child11:56660] [21] pid:56670 call:epoll_ctl callno:12 [child11:56660] [22] pid:56671 call:fremovexattr callno:33 [child11:56660] [23] pid:56672 call:mincore callno:117 [child11:56660] [24] pid:56673 call:init_module callno:136 [child11:56660] [25] pid:56674 call:inotify_init1 callno:20 [child11:56660] [26] pid:56675 call:ssetmask callno:45 [child11:56660] [27] pid:56676 call:mmap callno:46 [child11:56660] [28] pid:56677 call:access callno:115 [child11:56660] [29] pid:56678 call:ioprio_set callno:63 [child11:56660] [30] pid:56679 call:old_readdir callno:132 [child11:56660] [31] pid:56680 call:gettimeofday callno:89 I/O possible $ cheers -- To unsubscribe from this list: send the line "unsubscribe trinity" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html