On 02/21/2012 10:44 AM, Dave Anderson wrote: > > > ----- Original Message ----- >> We have a recurring problem in our crash analysis system, where remote users >> get disconnected and crash starts endlessly looping trying to write to stdout. >> An strace of a recent instance is looping on: >> >> write(1, " JIFFIES\n", 10) = -1 EIO (Input/output error) >> >> but that isn't always the output string. >> >> this is a problem in out shared environment because the orphaned crash tasks >> eat up the CPUs, and we don't have the privilege to kill each others tasks. >> >> thanks, >> --Guy > > Hmmm, upon initial glance, this seemed to be related to the crash-5.0.2 > fix that you guys reported: > > - Fix to prevent a crash session that is run over a network connection > that is killed/removed from going into 100% cpu-time loop. Without > the patch, the behavior of the built-in readline() library call in > gdb-7.0 has changed such that the function returns when the EOF is > encountered on /dev/tty, and the crash session goes into an endless > loop; whereas in gdb-6.1, the readline() call never returns because > the crash session gets killed while running in the library code. > (anderson@xxxxxxxxxx) > > But if the orphaned task is repetetively writing the same thing, it > would never get to the next readline() call, where it would kill > itself. Taking your example, the "JIFFIES" write() is part of a "timer" > command, but I'm trying to understand how/why the command is not just > completing a series of (failed) fprintf's, and then falling into > the next readline() -- where it should kill itself? By any chance > was the remote caller doing a "repeat" command on the live system, > or something like that? (sounds doubtful since you'd have to have > root privileges to do that...) > This is not a live system. This is the setup where we analyze vmcores sent in by our customers. I don't understand how it happens either, unless for some reason fprintf is re-trying the failed write(). This is not the only failure scenario. I just saw another one repeating on this sequence: rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0 rt_sigreturn(0x8) = -1 ENETDOWN (Network is down) --- SIGFPE (Floating point exception) @ 0 (0) --- rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0 rt_sigreturn(0x8) = -1 ENETDOWN (Network is down) --- SIGFPE (Floating point exception) @ 0 (0) --- rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0 rt_sigreturn(0x8) = -1 ENETDOWN (Network is down) --- SIGFPE (Floating point exception) @ 0 (0) --- Perhaps it isn't a crash program issue at all. Maybe it's at the system library level. --Guy -- Crash-utility mailing list Crash-utility@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/crash-utility