On Fri, 2007-07-27 at 14:21 -0500, Steve Rigler wrote: > Hello All, > > We are running GFS on RHEL4U3 (x86_64). One of our cluster nodes > crashes this afternoon. We are able to capture some of the message from > netdump (pasted below) before fencing killed the node. > > Any advice would be appreciated. > > Thanks, > Steve > > As a followup, this is past tense (the word "crashes" should have been "crashed"). One of the other nodes panicked after the first one tried to rejoin the cluster (this is a 3 node cluster). The dump from that node had these messages near the beginning of its crash: WARNING: dlm_emergency_shutdown WARNING: dlm_emergency_shutdown SM: 00000001 sm_stop: SG still joined SM: 01000002 sm_stop: SG still joined SM: 02000004 sm_stop: SG still joined SM: 0300000d sm_stop: SG still joined Followed by this: lock_dlm: Assertion failed on line 428 of file /usr/src/build/714650- x86_64/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c lock_dlm: assertion: "!error" lock_dlm: time = 5442621324 STUL03E: num=1,2 err=-22 cur=-1 req=3 lkf=0 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lock:428 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: nfsd exportfs nfs lockd nfs_acl parport_pc lp parport netconsole netdump autofs4 i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket pcmcia_core dm_mirror dm_round_robin dm_multipath button battery ac uhci_hcd ehci_hcd hw_random tg3 floppy ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod Pid: 30604, comm: umount Not tainted 2.6.9-34.ELsmp RIP: 0010:[<ffffffffa02689e7>] <ffffffffa02689e7>{:lock_dlm:do_dlm_lock +365} RSP: 0018:000001002ab6dc38 EFLAGS: 00010216 RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000000246 RDX: 000000000000996e RSI: 0000000000000246 RDI: ffffffff803d9e60 RBP: 0000010117945c80 R08: 0000000000000004 R09: 00000000ffffffea R10: 0000000000000000 R11: 00000000000000e4 R12: 00000100dfd23400 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000003 FS: 0000002a95575b00(0000) GS:ffffffff804d7b00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003f95fc60c0 CR3: 0000000000101000 CR4: 00000000000006e0 Process umount (pid: 30604, threadinfo 000001002ab6c000, task 00000101120da030) Stack: 0000000000000003 0000000000000000 3120202020202020 2020202020202020 3220202020202020 0000000000000018 0000010117945c80 0000000000000000 0000000000000003 0000000000000000 Call Trace:<ffffffffa0268b2a>{:lock_dlm:lm_dlm_lock+214} <ffffffffa022f93f>{:gfs:gfs_lm_lock+50} <ffffffffa02269da>{:gfs:gfs_glock_xmote_th+357} <ffffffffa0224cdd>{:gfs:run_queue+667} <ffffffffa0225ccf>{:gfs:gfs_glock_nq+938} <ffffffffa0225f11>{:gfs:gfs_glock_nq_init+20} <ffffffffa024629b>{:gfs:gfs_make_fs_ro+39} <ffffffffa023e508>{:gfs:gfs_put_super+630} <ffffffff8017d0c9>{generic_shutdown_super+202} <ffffffffa023c009>{:gfs:gfs_kill_sb+42} <ffffffff801ccb78>{dummy_inode_permission+0} <ffffffff8017cfe6>{deactivate_super+95} <ffffffff80192537>{sys_umount+925} <ffffffff80180264>{sys_newstat +17} <ffffffff80110c61>{error_exit+0} <ffffffff801101c6>{system_call +126} -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster