Anybody seeing this OOPS

balbir.singh@wipro.com (BALBIR SINGH) · Wed, 5 Jun 2002 17:00:16 +0530

This is a multi-part message in MIME format.

------=_NextPartTM-000-161c9ae9-473b-4794-9c91-9ae3d17f282c
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hello, Stephen,

Thanks for responding so quickly.

|Hi,
|
|On Wed, Jun 05, 2002 at 03:07:16PM +0530, BALBIR SINGH wrote:
| 
|> I am running Linux 2.4.7-10, 2.4.18-4 and 2.4.19-pre9, I
|> see the following oops quite often (mainly on 2.4.19-pre9 with
|> kdb). All the kernels I use have the kdb patch installed.
|
|Well, 2.4.7-10 and 2.4.18-4 both look like Red Hat kernel 
|releases, and those do _not_ have kdb installed.
|

Yes! I downloaded the 2.4.7-10 and got a suitable kdb patch.
And I have KDB for 2.4.19-pre9, but I am looking for a kdb
or any debugger patch that will apply well on 2.4.18-4. Redhat
used to include ikd with their debug kernels earlier, not sure
if that is correct even now.

Absolutely, it is difficult to get the OOPS with kdb running,
sometimes on doing a go on the kdb> prompt causes more panics
and the real panic does not appear in the dmesg or /var/log/messages
I will try and capture all the information on the next oops
and send it to you, I do not have the information, I lost it on
booting, sorry!

I turned on CONFIG_DEBUG_SLAB and I realized using (gcc -S -g)
that this was in the slab checking (POISON part).

|> One more thing I have seen is a problem with the following code
|> 
|>         do {
|>                 new_bh = get_unused_buffer_head(0);
|>                 if (!new_bh) {
|>                         printk (KERN_NOTICE __FUNCTION__
|>                                 ": ENOMEM at 
|get_unused_buffer_head, "
|>                                 "trying again.\n");
|>                         current->policy |= SCHED_YIELD;
|>                         schedule();
|>                 }
|> 
|> when get_unused_buffer_head fails, the call to printk would 
|eventually 
|> want to flush the contents to /var/log/messages and if 
|> /var/log/messages happens to be on a journalled file system, well it 
|> kind of gets recursive.
|
|No, the printk buffer will wrap (discarding data if necessary) 
|if klogd can't dump the information from it fast enough, so 
|there's no deadlock.
|

Pardon me, but this might be just my impression. Ext3 these days
has been very unstable on my system (I ran h/w tests to ensure it
is not a h/w problem, I checked my SCSI disk using the BIOS and the
memory using memtest). The code mentioned above causes my system
to hang, any file system operation hangs on entering kdb I find the
following

1. The system has called bdflush - shrink_cache, et.al - kmem_cache_reap
2. The system is doing a printk (the code mentioned above)

The system seems to spin around in the code path mentioned above.

In the case of the printk buffer wrapping. This is kind of what I
think happens

1. Multiple printks queued for writing out by klogd.
2. klogd tries to write out the data calls file system specific code
3. Ext3 runs out of buffer heads, calls printk - now even if the buffer
   wrapped around, it has another printk in its queue, we go back
   to step 2.

Like you said, it might be the VM that is broken.

Have you tried running a some tests on ext3 with DEBUG_EXT3 and
CONFIG_DEBUG_SLAB
turned on? I will try IOZONE with these options and see if I can find
something.

|Cheers,
| Stephen
|

------=_NextPartTM-000-161c9ae9-473b-4794-9c91-9ae3d17f282c
Content-Type: text/plain;
	name="Wipro_Disclaimer.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="Wipro_Disclaimer.txt"

**************************Disclaimer************************************

Information contained in this E-MAIL being proprietary to Wipro Limited is 
'privileged' and 'confidential' and intended for use only by the individual
 or entity to which it is addressed. You are notified that any use, copying 
or dissemination of the information contained in the E-MAIL in any manner 
whatsoever is strictly prohibited.

***************************************************************************

------=_NextPartTM-000-161c9ae9-473b-4794-9c91-9ae3d17f282c--