Disk Hangs with 2.4.18 and ext3

sct at redhat.com (Stephen C. Tweedie) · Wed, 31 Jul 2002 11:24:48 +0100

Hi,

On Tue, Jul 30, 2002 at 05:34:16PM -0600, Kevin Fenzi wrote:

> 3 times now, the server has gotten into an unusable state. 
> (twice was with the 2.4.18-4smp redhat kernel, once with the
> 2.4.18-5smp kernel). 
> 
> When it enters this state any process that accesses the large 
> mail data partition (also exported over NFSv2) will stop in disk
> wait and never complete. 

When debugging a hang like that, you have the problem that a hang in
one place can cause other parts of the kernel to lock up too.  If the
disk locks up, then anything waiting for a disk IO to complete will
wait forever, for example.

So the first thing to do is to look for the blocked processes and find
what is the lowest-level hang going on, and in this case...

> There are also a few processes stuck in DAC960_processRequest...
> 
> updatedb      D 00000000     0 12096  12093                     (NOTLB)
> Call Trace: [<f884e447>] DAC960_ProcessRequest [DAC960] 0xc7
> [<c0118ecb>] sleep_on [kernel] 0x4b 
> [<f885d225>] start_this_handle [jbd] 0xc5 
> [<f885d37d>] journal_start_Rsmp_89deb980 [jbd] 0xbd 
> [<f887297e>] ext3_dirty_inode [ext3] 0x6e 

...there are signs of the raid driver being blocked.

However, it would be useful to see the whole call trace, because this
particular entry looks as if the DAC960_ProcessRequest entry is simply
the result of a recent interrupt, and we're not actually in that call
path at the moment (the DAC960 address has simply been left behind on
the stack so is picked up by the trace output.)

> kjournald is also stuck:
> 
> kjournald     D F757A000  3968   119      1           199   118 (L-TLB)
> Call Trace: [<c0118ecb>] sleep_on [kernel] 0x4b 

That is just the commit thread waiting for other ext3 transactions to
complete so that it can start the commit: ie. kjournald is a victim,
not a cause, here.

Cheers,
 Stephen