On Mon, 2005-04-11 at 20:30, David Teigland wrote: > On Mon, Apr 11, 2005 at 05:13:06PM -0700, Daniel McNeil wrote: > > I started my mount/tar/rm/ tests on Apr 4 17:41 and I hit > > a problem at Apr 6 05:30. So the test ran for 36 hours. > > cl030 and cl031 were getting "SM: process_reply invalid" > > messages and cl032 got "No response" and "Missed too many > > heartbeats" > > The SM messages are an effect of CMAN removing nodes. There's a fair > chance that this recent fix will help: > http://sources.redhat.com/ml/cluster-cvs/2005-q2/msg00018.html Good news and bad news. Good news: I think my previous problem was an network upgrade that accidentally cut off one of my nodes. Bad news: after upgrading to the latest cvs I hit an oops after 12 hours. The below looks life we are accessing freed memory. I have slab debug and spin lock debug configured. Here's the oops: Unable to handle kernel paging request at virtual address 6b6b6bbf printing eip: c03e8682 *pde = 00000000 Oops: 0002 [#1] PREEMPT SMP Modules linked in: lock_dlm dlm gfs lock_harness cman qla2200 qla2xxx dm_mod video CPU: 0 EIP: 0060:[<c03e8682>] Not tainted VLI EFLAGS: 00010246 (2.6.11) EIP is at _spin_lock+0x22/0x90 eax: 00000000 ebx: 6b6b6bbf ecx: 00000001 edx: cdc82000 esi: cdc82000 edi: 6b6b6bbf ebp: cdc82ea4 esp: cdc82e9c ds: 007b es: 007b ss: 0068 Process umount (pid: 14022, threadinfo=cdc82000 task=cc113a60) Stack: d2bee958 d2beea7c cdc82ebc c0162f06 d2bee958 d2bee968 d2bee958 6b6b6b6b cdc82edc c017bb24 d2bee958 00004192 00000001 cdc82eec ce844050 f90314e0 cdc82efc c017bc14 cbd665d0 cdc82eec d2bee4ec cbe47b3c cbd66544 ce844050 Call Trace: [<c01041ff>] show_stack+0x7f/0xa0 [<c01043b2>] show_registers+0x162/0x1e0 [<c01045de>] die+0xfe/0x190 [<c0115892>] do_page_fault+0x3b2/0x6f2 [<c0103e57>] error_code+0x2b/0x30 [<c0162f06>] invalidate_inode_buffers+0x46/0x90 [<c017bb24>] invalidate_list+0x44/0xe0 [<c017bc14>] invalidate_inodes+0x54/0x90 [<c0167974>] generic_shutdown_super+0x74/0x140 [<f9010aee>] gfs_kill_sb+0x2e/0x69 [gfs] [<c0167821>] deactivate_super+0x81/0xa0 [<c017ed5c>] sys_umount+0x3c/0xa0 [<c017edd9>] sys_oldumount+0x19/0x20 [<c010335d>] sysenter_past_esp+0x52/0x75 Code: 00 00 00 8d bf 00 00 00 00 55 89 e5 83 ec 08 89 1c 24 89 c3 b8 01 00 00 00 89 74 24 04 e8 47 06 d3 ff be 00 f0 ff ff 21 e6 31 c0 <86> 03 84 c0 7e 0b 8b 1c 24 8b 74 24 04 89 ec 5d c3 b8 01 00 00 Daniel -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster