Hi, On Tue, Jul 30, 2002 at 05:34:16PM -0600, Kevin Fenzi wrote: > 3 times now, the server has gotten into an unusable state. > (twice was with the 2.4.18-4smp redhat kernel, once with the > 2.4.18-5smp kernel). > > When it enters this state any process that accesses the large > mail data partition (also exported over NFSv2) will stop in disk > wait and never complete. When debugging a hang like that, you have the problem that a hang in one place can cause other parts of the kernel to lock up too. If the disk locks up, then anything waiting for a disk IO to complete will wait forever, for example. So the first thing to do is to look for the blocked processes and find what is the lowest-level hang going on, and in this case... > There are also a few processes stuck in DAC960_processRequest... > > updatedb D 00000000 0 12096 12093 (NOTLB) > Call Trace: [<f884e447>] DAC960_ProcessRequest [DAC960] 0xc7 > [<c0118ecb>] sleep_on [kernel] 0x4b > [<f885d225>] start_this_handle [jbd] 0xc5 > [<f885d37d>] journal_start_Rsmp_89deb980 [jbd] 0xbd > [<f887297e>] ext3_dirty_inode [ext3] 0x6e ...there are signs of the raid driver being blocked. However, it would be useful to see the whole call trace, because this particular entry looks as if the DAC960_ProcessRequest entry is simply the result of a recent interrupt, and we're not actually in that call path at the moment (the DAC960 address has simply been left behind on the stack so is picked up by the trace output.) > kjournald is also stuck: > > kjournald D F757A000 3968 119 1 199 118 (L-TLB) > Call Trace: [<c0118ecb>] sleep_on [kernel] 0x4b That is just the commit thread waiting for other ext3 transactions to complete so that it can start the commit: ie. kjournald is a victim, not a cause, here. Cheers, Stephen