Disk Hangs with 2.4.18 and ext3

kevin at tummy.com (Kevin Fenzi) · Wed, 31 Jul 2002 13:47:17 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thanks for the quick reply! 

>>>>> "Stephen" == Stephen C Tweedie <sct@redhat.com> writes:

Stephen> Hi, On Tue, Jul 30, 2002 at 05:34:16PM -0600, Kevin Fenzi
Stephen> wrote:

>> 3 times now, the server has gotten into an unusable state. (twice
>> was with the 2.4.18-4smp redhat kernel, once with the 2.4.18-5smp
>> kernel).
>> 
>> When it enters this state any process that accesses the large mail
>> data partition (also exported over NFSv2) will stop in disk wait
>> and never complete.

Stephen> When debugging a hang like that, you have the problem that a
Stephen> hang in one place can cause other parts of the kernel to lock
Stephen> up too.  If the disk locks up, then anything waiting for a
Stephen> disk IO to complete will wait forever, for example.

yeah... :( 

Stephen> So the first thing to do is to look for the blocked processes
Stephen> and find what is the lowest-level hang going on, and in this
Stephen> case...

>> There are also a few processes stuck in DAC960_processRequest...
>> 
>> updatedb D 00000000 0 12096 12093 (NOTLB) Call Trace: [<f884e447>]
>> DAC960_ProcessRequest [DAC960] 0xc7 [<c0118ecb>] sleep_on [kernel]
>> 0x4b [<f885d225>] start_this_handle [jbd] 0xc5 [<f885d37d>]
>> journal_start_Rsmp_89deb980 [jbd] 0xbd [<f887297e>]
>> ext3_dirty_inode [ext3] 0x6e

Stephen> ...there are signs of the raid driver being blocked.

Is this indicitive of a hardware problem with the RAID controller? or
something else? We did do quite a bit of heavy disk activity on this
machine during burnin. 

Stephen> However, it would be useful to see the whole call trace,
Stephen> because this particular entry looks as if the
Stephen> DAC960_ProcessRequest entry is simply the result of a recent
Stephen> interrupt, and we're not actually in that call path at the
Stephen> moment (the DAC960 address has simply been left behind on the
Stephen> stack so is picked up by the trace output.)

ok. Where would I be able to get that? 
I do have the complete sysrq output of process traces... 

>> kjournald is also stuck:
>> 
>> kjournald D F757A000 3968 119 1 199 118 (L-TLB) Call Trace:
>> [<c0118ecb>] sleep_on [kernel] 0x4b

Stephen> That is just the commit thread waiting for other ext3
Stephen> transactions to complete so that it can start the commit:
Stephen> ie. kjournald is a victim, not a cause, here.

yeah, as I suspected...

The machine has been running with ext2 for a day and a half with no
lockups so far, but that might just mean the conditions haven't hit it
yet. 

Is there any further info I could gather that would help tracking down
the culprit on this? 

Stephen> Cheers, Stephen

kevin
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.6 and Gnu Privacy Guard <http://www.gnupg.org/>

iD8DBQE9SD7J3imCezTjY0ERAktZAKCICtv/lo/W+uBuDDbCqf2kIewPzwCfalr9
Pe+RJZqJ8KgYjI5KtKxRARA=
=WNyY
-----END PGP SIGNATURE-----