Every few days or so our cluster machines seem to have kernel panics comp laing about GFS locking (although its pretty irregular, we went for a few weeks without an outage) We noticed that this happened a LOT, and it was reproducible when certain users accessed files, when we were serving afp off the cluster. We have changed things since then so that afp is run on a server which nfs mounts the cluster. We are running FC4 with the gfs modules from yum. Here is our most recent kernel panics, followed by one from when we had afp running on the cluster: (it looks like there is relevant info above the cut-here, possibly if it might be helpful) Oct 19 14:44:41 meow kernel: ------------[ cut here ]------------ Oct 19 14:44:41 meow kernel: kernel BUG at /usr/src/build/607755-i686/BUILD/smp/src/lockqueue.c:1144! Oct 19 14:44:41 meow kernel: invalid operand: 0000 [#1] Oct 19 14:44:41 meow kernel: SMP Oct 19 14:44:41 meow kernel: Modules linked in: nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman(U) md5 ip v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Oct 19 14:44:41 meow kernel: CPU: 1 Oct 19 14:44:41 meow kernel: EIP: 0060:[<f8af9dcf>] Not tainted VLI Oct 19 14:44:41 meow kernel: EFLAGS: 00010292 (2.6.12-1.1447_FC4smp) Oct 19 14:44:41 meow kernel: EIP is at process_cluster_request+0xddb/0xdef [dlm] Oct 19 14:44:41 meow kernel: eax: 00000004 ebx: 00000000 ecx: c035fa4c edx: 00000286 Oct 19 14:44:41 meow kernel: esi: f7fb8400 edi: 00000000 ebp: d2988000 esp: f7eefe24 Oct 19 14:44:41 meow kernel: ds: 007b es: 007b ss: 0068 Oct 19 14:44:41 meow kernel: Process dlm_recvd (pid: 2402, threadinfo=f7eef000 task=f7851020) Oct 19 14:44:41 meow kernel: Stack: f8b0621b 00000001 f8b071e0 f8b06217 2583f987 00000001 00000040 00004000 Oct 19 14:44:41 meow kernel: f7eefe48 00000000 c038e1a0 00000a58 f0167b00 c02a26c1 00000a58 00004040 Oct 19 14:44:41 meow kernel: 00000072 f7eefed4 00000000 00000001 00000246 00000000 edd6eeb8 00000000 Oct 19 14:44:41 meow kernel: Call Trace: Oct 19 14:44:41 meow kernel: [<c02a26c1>] sock_recvmsg+0x103/0x11e Oct 19 14:44:41 meow kernel: [<f8afd46b>] midcomms_process_incoming_buffer+0x13b/0x25f [dlm] Oct 19 14:44:41 meow kernel: [<c011ce54>] load_balance_newidle+0x23/0x82 Oct 19 14:44:41 meow kernel: [<f8afb3d3>] receive_from_sock+0x196/0x2c9 [dlm] Oct 19 14:44:41 meow kernel: [<c0307705>] schedule+0x405/0xc5e Oct 19 14:44:41 meow kernel: [<c0307731>] schedule+0x431/0xc5e Oct 19 14:44:41 meow kernel: [<f8afc457>] dlm_recvd+0x0/0x9c [dlm] Oct 19 14:44:41 meow kernel: [<f8afc2d3>] process_sockets+0x75/0xb7 [dlm] Oct 19 14:44:41 meow kernel: [<f8afc4c7>] dlm_recvd+0x70/0x9c [dlm] Oct 19 14:44:41 meow kernel: [<c0134c09>] kthread+0x93/0x97 Oct 19 14:44:41 meow kernel: [<c0134b76>] kthread+0x0/0x97 Oct 19 14:44:41 meow kernel: [<c01023d1>] kernel_thread_helper+0x5/0xb Oct 19 14:44:41 meow kernel: Code: 4f 82 62 c7 89 e8 e8 b1 b4 00 00 8b 4c 24 14 89 4c 24 04 c7 04 24 6d 63 b0 f8 e8 34 82 62 c7 c7 04 24 1b 62 b0 f8 e8 28 82 62 c7 <0f> 0b 78 04 e0 71 b0 f8 c7 04 24 70 72 b0 f8 e8 40 78 62 c7 57 Oct 19 14:44:41 meow kernel: <0>Fatal exception: panic in 5 seconds Panic 2: Oct 10 09:58:39 woof kernel: ------------[ cut here ]------------ Oct 10 09:58:39 woof kernel: kernel BUG at /usr/src/build/607778-i686/BUILD/smp/src/dlm/lock.c:411! Oct 10 09:58:39 woof kernel: invalid operand: 0000 [#1] Oct 10 09:58:39 woof kernel: SMP Oct 10 09:58:39 woof kernel: Modules linked in: nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman(U) md5 ip v6 sunrpc ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1 000 dm_snapshot dm_zero dm_mirror ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Oct 10 09:58:39 woof kernel: CPU: 1 Oct 10 09:58:39 woof kernel: EIP: 0060:[<f8b98bf5>] Not tainted VLI Oct 10 09:58:39 woof kernel: EFLAGS: 00010292 (2.6.12-1.1447_FC4smp) Oct 10 09:58:39 woof kernel: EIP is at do_dlm_lock+0x1b7/0x21d [lock_dlm] Oct 10 09:58:39 woof kernel: eax: 00000004 ebx: 00000000 ecx: c035fa4c edx: 00000292 Oct 10 09:58:39 woof kernel: esi: f7848140 edi: ffffffea ebp: 00000003 esp: c74b3cfc Oct 10 09:58:39 woof kernel: ds: 007b es: 007b ss: 0068 Oct 10 09:58:39 woof kernel: Process imapd (pid: 24278, threadinfo=c74b3000 task=f4721a80) Oct 10 09:58:39 woof kernel: Stack: f8b9de75 f7848140 00000003 1bbe0000 00000000 ffffffea 00000003 00000005 Oct 10 09:58:39 woof kernel: 0000000d 00000005 00000000 f58c0a00 00000001 0000000d 20200000 20202020 Oct 10 09:58:39 woof kernel: 20203320 20202020 62312020 30306562 00183030 c8fb2f00 00000001 00000001 Oct 10 09:58:39 woof kernel: Call Trace: Oct 10 09:58:39 woof kernel: [<f8b98cff>] lm_dlm_lock+0x52/0x5e [lock_dlm] Oct 10 09:58:39 woof kernel: [<f8b98cad>] lm_dlm_lock+0x0/0x5e [lock_dlm] Oct 10 09:58:39 woof kernel: [<f8bd000c>] gfs_lm_lock+0x3d/0x5c [gfs] Oct 10 09:58:39 woof kernel: [<f8bc5039>] gfs_glock_xmote_th+0xae/0x1d3 [gfs] Oct 10 09:58:39 woof kernel: [<f8bc463c>] rq_promote+0x126/0x150 [gfs] Oct 10 09:58:39 woof kernel: [<f8bc4840>] run_queue+0xee/0x113 [gfs] Oct 10 09:58:39 woof kernel: [<f8bc5af1>] gfs_glock_nq+0x93/0x144 [gfs] Oct 10 09:58:39 woof kernel: [<f8bc619d>] gfs_glock_nq_init+0x18/0x2d [gfs] Oct 10 09:58:39 woof kernel: [<f8be3926>] get_local_rgrp+0xca/0x1b0 [gfs] Oct 10 09:58:39 woof kernel: [<f8be3a9c>] gfs_inplace_reserve_i+0x90/0xd0 [gfs] Oct 10 09:58:39 woof kernel: [<f8be046b>] gfs_quota_lock_m+0xbf/0x117 [gfs] Oct 10 09:58:39 woof kernel: [<f8bd8a2b>] do_do_write_buf+0x3a1/0x485 [gfs] Oct 10 09:58:39 woof kernel: [<f8bc56a1>] glock_wait_internal+0x16b/0x26a [gfs] Oct 10 09:58:39 woof kernel: [<f8bd8c91>] do_write_buf+0x182/0x1b6 [gfs] Oct 10 09:58:39 woof kernel: [<f8bd7be5>] walk_vm+0xb3/0x111 [gfs] Oct 10 09:58:39 woof kernel: [<f8bd8d65>] gfs_write+0xa0/0xc2 [gfs] Oct 10 09:58:39 woof kernel: [<f8bd8b0f>] do_write_buf+0x0/0x1b6 [gfs] Oct 10 09:58:39 woof kernel: [<f8bd8cc5>] gfs_write+0x0/0xc2 [gfs] Oct 10 09:58:39 woof kernel: [<c0162987>] vfs_write+0x9e/0x110 Oct 10 09:58:39 woof kernel: [<c0162aa4>] sys_write+0x41/0x6a Oct 10 09:58:39 woof kernel: [<c0104035>] syscall_call+0x7/0xb Oct 10 09:58:39 woof kernel: Code: 7c 24 14 89 4c 24 0c 89 5c 24 10 89 6c 24 08 89 74 24 04 c7 04 24 28 e6 b9 f8 e8 0e 94 58 c7 c7 04 24 75 de b9 f8 e8 02 94 58 c7 <0f> 0b 9b 01 a0 e4 b9 f8 c7 04 24 3c e5 b9 f8 e8 1a 8a 58 c7 66 Oct 10 09:58:39 woof kernel: <0>Fatal exception: panic in 5 seconds Sep 7 15:37:44 meow kernel: ------------[ cut here ]------------ Sep 7 15:37:44 meow kernel: kernel BUG at /usr/src/build/588748-i686/BUILD/smp/src/dlm/plock.c:500! Sep 7 15:37:44 meow kernel: invalid operand: 0000 [#1] Sep 7 15:37:44 meow kernel: SMP Sep 7 15:37:44 meow kernel: Modules linked in: appletalk nfsd exportfs lockd autofs4 lock_dlm(U) gfs(U) lock_harness(U) rfcomm l2cap bluetooth dlm(U) cman (U) sunrpc md5 ipv6 ipt_LOG ipt_limit ipt_state ip_conntrack iptable_filter ip_tables video button battery ac uhci_hcd ehci_hcd hw_random i2c_i801 i2c_core shpchp e1000 floppy ext3 jbd raid1 dm_mod qla2200 qla2xxx scsi_transport_fc ata_piix libata sd_mod scsi_mod Sep 7 15:37:44 meow kernel: CPU: 3 Sep 7 15:37:44 meow kernel: EIP: 0060:[<f8b9a3f7>] Tainted: GF VLI Sep 7 15:37:44 meow kernel: EFLAGS: 00010292 (2.6.12-1.1398_FC4smp) Sep 7 15:37:44 meow kernel: EIP is at update_lock+0x87/0x9b [lock_dlm] Sep 7 15:37:44 meow kernel: eax: 00000004 ebx: fffffff5 ecx: c035ca4c edx: 00000282 Sep 7 15:37:44 meow kernel: esi: 00000000 edi: e99c2c00 ebp: 00000000 esp: d05dedb4 Sep 7 15:37:44 meow kernel: ds: 007b es: 007b ss: 0068 Sep 7 15:37:44 meow kernel: Process afpd (pid: 3872, threadinfo=d05de000 task=d6447550) Sep 7 15:37:44 meow kernel: Stack: badc0ded f8b9d0d6 fffffff5 f8b9da70 f8b9d101 06609291 f7943000 00000000 Sep 7 15:37:44 meow kernel: f8b9a499 7ffffff8 00000000 7ffffff8 00000000 d05dede8 d7636700 7ffffff8 Sep 7 15:37:44 meow kernel: 00000000 d05deea8 d05dee28 f8b9a987 00000001 7ffffff8 00000000 7ffffff8 Sep 7 15:37:44 meow kernel: Call Trace: Sep 7 15:37:44 meow kernel: [<f8b9a499>] add_lock+0x8e/0xed [lock_dlm] Sep 7 15:37:44 meow kernel: [<f8b9a987>] fill_gaps+0x87/0x10e [lock_dlm] Sep 7 15:37:44 meow kernel: [<f8b9aa51>] lock_case3+0x43/0xac [lock_dlm] Sep 7 15:37:44 meow kernel: [<f8b9aeac>] plock_internal+0x1aa/0x370 [lock_dlm] Sep 7 15:37:44 meow kernel: [<f8b9b614>] lm_dlm_plock+0x25b/0x2dc [lock_dlm] Sep 7 15:37:44 meow kernel: [<f8b9b3b9>] lm_dlm_plock+0x0/0x2dc [lock_dlm] Sep 7 15:37:44 meow kernel: [<f8bdc1c3>] gfs_lm_plock+0x45/0x57 [gfs] Sep 7 15:37:44 meow kernel: [<f8be5731>] gfs_lock+0xcd/0x11c [gfs] Sep 7 15:37:44 meow kernel: [<f8be5664>] gfs_lock+0x0/0x11c [gfs] Sep 7 15:37:44 meow kernel: [<c0176c4f>] fcntl_setlk64+0x16c/0x26a Sep 7 15:37:44 meow kernel: [<c0162e93>] fget+0x3b/0x42 Sep 7 15:37:44 meow kernel: [<c0172bfd>] sys_fcntl64+0x55/0x97 Sep 7 15:37:44 meow kernel: [<c0104025>] syscall_call+0x7/0xb Sep 7 15:37:44 meow kernel: Code: 01 00 00 c7 04 24 a8 da b9 f8 e8 7c 77 58 c7 89 5c 24 04 c7 04 24 08 d1 b9 f8 e8 6c 77 58 c7 c7 04 24 d6 d0 b9 f8 e8 60 77 58 c7 <0f> 0b f4 01 70 da b9 f8 c7 04 24 10 db b9 f8 e8 78 6d 58 c7 55 Sep 7 15:37:44 meow kernel: <0>Fatal exception: panic in 5 seconds Thanks for any help, Ethan -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster