Re: crash endlessly looping on stdout error

Guy Streeter <streeter@xxxxxxxxxx> · Wed, 22 Feb 2012 11:01:02 -0600

On 02/21/2012 10:44 AM, Dave Anderson wrote:
> 
> 
> ----- Original Message -----
>> We have a recurring problem in our crash analysis system, where remote users
>> get disconnected and crash starts endlessly looping trying to write to stdout.
>> An strace of a recent instance is looping on:
>>
>> write(1, "  JIFFIES\n", 10)             = -1 EIO (Input/output error)
>>
>> but that isn't always the output string.
>>
>> this is a problem in out shared environment because the orphaned crash tasks
>> eat up the CPUs, and we don't have the privilege to kill each others tasks.
>>
>> thanks,
>> --Guy
> 
> Hmmm, upon initial glance, this seemed to be related to the crash-5.0.2
> fix that you guys reported:
> 
>     - Fix to prevent a crash session that is run over a network connection
>       that is killed/removed from going into 100% cpu-time loop.  Without
>       the patch, the behavior of the built-in readline() library call in
>       gdb-7.0 has changed such that the function returns when the EOF is
>       encountered on /dev/tty, and the crash session goes into an endless
>       loop; whereas in gdb-6.1, the readline() call never returns because
>       the crash session gets killed while running in the library code.
>       (anderson@xxxxxxxxxx)
> 
> But if the orphaned task is repetetively writing the same thing, it 
> would never get to the next readline() call, where it would kill
> itself.  Taking your example, the "JIFFIES" write() is part of a "timer"
> command, but I'm trying to understand how/why the command is not just 
> completing a series of (failed) fprintf's, and then falling into
> the next readline() -- where it should kill itself?  By any chance
> was the remote caller doing a "repeat" command on the live system,
> or something like that?  (sounds doubtful since you'd have to have
> root privileges to do that...)
> 

This is not a live system. This is the setup where we analyze vmcores sent in
by our customers.
I don't understand how it happens either, unless for some reason fprintf is
re-trying the failed write().
This is not the only failure scenario. I just saw another one repeating on
this sequence:

rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8)                       = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8)                       = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---
rt_sigaction(SIGFPE, {0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0},
{0x550ff0, [FPE], SA_RESTORER|SA_RESTART, 0x370ac302d0}, 8) = 0
rt_sigreturn(0x8)                       = -1 ENETDOWN (Network is down)
--- SIGFPE (Floating point exception) @ 0 (0) ---

Perhaps it isn't a crash program issue at all. Maybe it's at the system
library level.

--Guy

--
Crash-utility mailing list
Crash-utility@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/crash-utility