On Thu, May 22, 2014 at 12:40:36PM +1000, Michael Ellerman wrote: > Sorry I didn't get back to you on this. I've been chasing a bug that trinity > found for us. > > Running aae6d6a I've seen this once, but only once: > > [watchdog] Sanity check failed! Found pid 1885550132! > [watchdog] problem checking on pid 112 (1:Operation not permitted) > [watchdog] pid 1885550132 has disappeared (oom-killed maybe?). Reaping. > [watchdog] pid 678326126 has disappeared (oom-killed maybe?). Reaping. > [watchdog] pid 1697185792 has disappeared (oom-killed maybe?). Reaping. > [watchdog] Reaped 3 dead children > Killed If it happens again, check /proc/sys/kernel/pid_max. I wonder if something scribbled in there. (We only read it on startup, so if it changes under us, and we start getting pids out of our expected range, that could go awry). I'll add some more robustness to that check tomorrow. Though looking at the pids in the dump above, I wonder if there's something more screwed up, like we corrupted the ptrs to the pid map in the shm. Dave -- To unsubscribe from this list: send the line "unsubscribe trinity" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html