Hi, I am getting the following Oops on 2.6.30.1 kernel. The bad part is, it happens rarely (twice in last 1.5 months) and the system is pretty lightly loaded when this happens (no heavy file/disk io). Any insights or patches that I can try? (i searched lkml and ext3 lists but could not find any similar oops/reports). == Oops =================== BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff80373520>] __journal_remove_journal_head+0x10/0x120 PGD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/class/scsi_host/host0/proc_name CPU 0 Pid: 3834, comm: kjournald Not tainted 2.6.30.1_test #1 RIP: 0010:[<ffffffff80373520>] [<ffffffff80373520>] __journal_remove_journal_head+0x10/0x120 RSP: 0018:ffff880c7ee11d80 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000034 RDX: 0000000000000002 RSI: ffff8804ee82aa20 RDI: ffff8804ee82aa20 RBP: ffff880c7ee11d90 R08: 0400000000000000 R09: 0000000000000000 R10: ffffffff803706af R11: 0000000000000000 R12: ffff8808659bc198 R13: 0000000000000001 R14: ffff880bd435a980 R15: ffff880c7959d000 FS: 0000000000000000(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kjournald (pid: 3834, threadinfo ffff880c7ee10000, task ffff880c794900c0) Stack: ffff8804ee82aa20 ffff8808659bc198 ffff880c7ee11db0 ffffffff80374fd4 ffff880c7ee11db0 ffff8804ee82aa20 ffff880c7ee11e90 ffffffff8037073d ffff880c7959d3a8 ffff880c7ee11e48 ffff880c7959d028 ffff880c7959d338 Call Trace: [<ffffffff80374fd4>] journal_remove_journal_head+0x24/0x50 [<ffffffff8037073d>] journal_commit_transaction+0x41d/0x1150 [<ffffffff8024f8cc>] ? try_to_del_timer_sync+0x5c/0x70 [<ffffffff8037498f>] kjournald+0xff/0x270 [<ffffffff8025c370>] ? autoremove_wake_function+0x0/0x40 [<ffffffff80374890>] ? kjournald+0x0/0x270 [<ffffffff8025bf63>] kthread+0x63/0x90 [<ffffffff8020cffa>] child_rip+0xa/0x20 [<ffffffff8025bf00>] ? kthread+0x0/0x90 [<ffffffff8020cff0>] ? child_rip+0x0/0x20 Code: 1f 44 00 00 48 89 f8 48 8b 3d 7d 0d ca 00 48 89 c6 e8 85 35 f5 ff c9 c3 0f 1f 00 55 48 89 e5 41 54 53 0f 1f 44 00 00 48 8b 5f 40 <8b> 4b 08 85 c9 0f 88 f2 00 00 00 f0 ff 47 60 8b 53 08 85 d2 75 RIP [<ffffffff80373520>] __journal_remove_journal_head+0x10/0x120 RSP <ffff880c7ee11d80> CR2: 0000000000000008 ---[ end trace 2a47799c65258934 ]--- Looking at the disassembly of journal_remove_head(): ============== 0xffffffff8037b760 <__journal_remove_journal_head+0>: push %rbp 0xffffffff8037b761 <__journal_remove_journal_head+1>: mov %rsp,%rbp 0xffffffff8037b764 <__journal_remove_journal_head+4>: push %r12 0xffffffff8037b766 <__journal_remove_journal_head+6>: push %rbx 0xffffffff8037b767 <__journal_remove_journal_head+7>: callq 0xffffffff8020bcc0 <mcount> 0xffffffff8037b76c <__journal_remove_journal_head+12>: mov 0x40(%rdi),%rbx 0xffffffff8037b770 <__journal_remove_journal_head+16>: mov 0x8(%rbx),%r8d <====== Oops 0xffffffff8037b774 <__journal_remove_journal_head+20>: test %r8d,%r8d 0xffffffff8037b777 <__journal_remove_journal_head+23>: js 0xffffffff8037b86d <__journal_remove_journal_head+269> 0xffffffff8037b77d <__journal_remove_journal_head+29>: lock incl 0x60(%rdi) 0xffffffff8037b781 <__journal_remove_journal_head+33>: mov 0x8(%rbx),%esi 0xffffffff8037b784 <__journal_remove_journal_head+36>: test %esi,%esi 0xffffffff8037b786 <__journal_remove_journal_head+38>: jne 0xffffffff8037b78f <__journal_remove_journal_head+47> 0xffffffff8037b788 <__journal_remove_journal_head+40>: cmpq $0x0,0x28(%rbx) 0xffffffff8037b78d <__journal_remove_journal_head+45>: je 0xffffffff8037b794 <__journal_remove_journal_head+52> ....... ....... ============== The oops seems be due to NULL journal head while evaluating J_ASSERT_JH() macro: ============== static void __journal_remove_journal_head(struct buffer_head *bh) { struct journal_head *jh = bh2jh(bh); J_ASSERT_JH(jh, jh->b_jcount >= 0); <=== jh is NULL get_bh(bh); if (jh->b_jcount == 0) { if (jh->b_transaction == NULL && .... ============= Not sure why would that happen (corruption?). Few system details: ================ - 64-bit, 2 quad-core (total 8 cores) Xeon, 48GB RAM - Stock 2.6.30.1 kernel, *no* modules - ext3 file-system (data=ordered mode) used over encrypted (dmcrypt) disks. - underlying storage: h/w RAID. - ext*/jbd config values: CONFIG_EXT3_FS=y CONFIG_EXT3_DEFAULTS_TO_ORDERED=y CONFIG_EXT3_FS_XATTR=y # CONFIG_EXT3_FS_POSIX_ACL is not set # CONFIG_EXT3_FS_SECURITY is not set CONFIG_EXT4_FS=y # CONFIG_EXT4DEV_COMPAT is not set CONFIG_EXT4_FS_XATTR=y CONFIG_EXT4_FS_POSIX_ACL=y CONFIG_EXT4_FS_SECURITY=y CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_JBD2=y # CONFIG_JBD2_DEBUG is not set CONFIG_FS_MBCACHE=y # CONFIG_REISERFS_FS is not set # CONFIG_JFS_FS is not set CONFIG_FS_POSIX_ACL=y =================== Let me know if you need any more details. Reproducing this (or finding a good test to trigger this) is proving to be difficult :-( It sorta oops once in a while ;-) thanks abhijit ps: please Cc: me the replies. I am not subscribed to either of the lists -- thanks! _______________________________________________ Ext3-users mailing list Ext3-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/ext3-users