On Mon, Oct 22, 2012 at 3:41 AM, Rolf Eike Beer <eike-kernel@xxxxxxxxx> wrote: > My C3600 runs the CMake nightly builds. Basically this is a master process > (ctest) that forks other binaries that do the actual tests. Afterwards it > collects the output. If the child does not respond for some time (usually > set to 30 minutes) it will get killed by ctest. > > Yesterday someone accidentially introduced an endless loop into CMake, so > some of the called tests will run at 100% CPU load forever. The master > process was not affected by this, so these childs should have eventually got > killed. But this did not happen. It did happen on all other machines > building those tests > (http://open.cdash.org/index.php?project=CMake&date=2012-10-21, e.g. > http://open.cdash.org/viewTest.php?onlyfailed&buildid=2621607), but not on > my machine. And from all what I can tell it does not look as if it is a > ctest bug, but something in the scheduler or something like that not working > properly. > > $ ping voyager > PING voyager (192.168.2.119) 56(84) bytes of data. > 64 bytes from voyager (192.168.2.119): icmp_seq=1 ttl=64 time=0.504 ms > 64 bytes from voyager (192.168.2.119): icmp_seq=2 ttl=64 time=0.268 ms > 64 bytes from voyager (192.168.2.119): icmp_seq=3 ttl=64 time=0.274 ms > > So, the machine is alive and the ping time is ok. Doing ssh to it will get > stuck for hours (literally). So, sadly, I have currently no way to get into > userland of the machine. What I know is: > > -ssh doesn't work > -kernel is alive > -the machine is very likely running at 100% CPU load from a normal user > account (with not too excessive RAM usage AFAIK) > > So for me it looks like this "userspace is at 100%" does something utterly > bad to the scheduling, as it seems that no other processes will get their > chance of running. If ctest would get it's chance it should have killed the > slave after ~30 minutes, and ssh should definitely work. From what I see on > other machines the worst case scenario would be 18 of these amok processes, > so after ~9 hours the dust should start to clear. That would have been > nearly 20 hours ago, so something is not working there. > > Any ideas? Get on the serial console and see if you can login? Then issue a magic-sysrq+t? What's userspace doing? Cheers, Carlos. -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html