-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks for the quick reply! >>>>> "Stephen" == Stephen C Tweedie <sct@redhat.com> writes: Stephen> Hi, On Tue, Jul 30, 2002 at 05:34:16PM -0600, Kevin Fenzi Stephen> wrote: >> 3 times now, the server has gotten into an unusable state. (twice >> was with the 2.4.18-4smp redhat kernel, once with the 2.4.18-5smp >> kernel). >> >> When it enters this state any process that accesses the large mail >> data partition (also exported over NFSv2) will stop in disk wait >> and never complete. Stephen> When debugging a hang like that, you have the problem that a Stephen> hang in one place can cause other parts of the kernel to lock Stephen> up too. If the disk locks up, then anything waiting for a Stephen> disk IO to complete will wait forever, for example. yeah... :( Stephen> So the first thing to do is to look for the blocked processes Stephen> and find what is the lowest-level hang going on, and in this Stephen> case... >> There are also a few processes stuck in DAC960_processRequest... >> >> updatedb D 00000000 0 12096 12093 (NOTLB) Call Trace: [<f884e447>] >> DAC960_ProcessRequest [DAC960] 0xc7 [<c0118ecb>] sleep_on [kernel] >> 0x4b [<f885d225>] start_this_handle [jbd] 0xc5 [<f885d37d>] >> journal_start_Rsmp_89deb980 [jbd] 0xbd [<f887297e>] >> ext3_dirty_inode [ext3] 0x6e Stephen> ...there are signs of the raid driver being blocked. Is this indicitive of a hardware problem with the RAID controller? or something else? We did do quite a bit of heavy disk activity on this machine during burnin. Stephen> However, it would be useful to see the whole call trace, Stephen> because this particular entry looks as if the Stephen> DAC960_ProcessRequest entry is simply the result of a recent Stephen> interrupt, and we're not actually in that call path at the Stephen> moment (the DAC960 address has simply been left behind on the Stephen> stack so is picked up by the trace output.) ok. Where would I be able to get that? I do have the complete sysrq output of process traces... >> kjournald is also stuck: >> >> kjournald D F757A000 3968 119 1 199 118 (L-TLB) Call Trace: >> [<c0118ecb>] sleep_on [kernel] 0x4b Stephen> That is just the commit thread waiting for other ext3 Stephen> transactions to complete so that it can start the commit: Stephen> ie. kjournald is a victim, not a cause, here. yeah, as I suspected... The machine has been running with ext2 for a day and a half with no lockups so far, but that might just mean the conditions haven't hit it yet. Is there any further info I could gather that would help tracking down the culprit on this? Stephen> Cheers, Stephen kevin -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: Processed by Mailcrypt 3.5.6 and Gnu Privacy Guard <http://www.gnupg.org/> iD8DBQE9SD7J3imCezTjY0ERAktZAKCICtv/lo/W+uBuDDbCqf2kIewPzwCfalr9 Pe+RJZqJ8KgYjI5KtKxRARA= =WNyY -----END PGP SIGNATURE-----