Well, the problem has gotten even stranger, now a node is mysteriously crashing with nothing in the logs: Nov 1 04:02:19 http2 kernel: dlm: http: process_lockqueue_reply id 20260 state 0 Nov 1 04:02:19 http2 kernel: dlm: http: process_lockqueue_reply id 202e2 state 0 Nov 1 04:02:19 http2 kernel: dlm: http: process_lockqueue_reply id 303d7 state 0 Nov 1 04:02:19 http2 kernel: dlm: http: process_lockqueue_reply id 50159 state 0 Nov 1 06:29:19 http2 sshd(pam_unix)[24026]: session opened for user root by root(uid=0) Nov 1 06:45:02 http2 syslogd 1.4.1: restart. Nov 1 06:45:02 http2 syslog: syslogd startup succeeded Earlier in the day I had this crash on my GNBD server though (might not be related to my other problem, but hey, who knows), looks like it's related to DLM: Oct 31 10:35:55 storage1 gnbd_serv[5073]: server process 25402 exited because of signal 15 Oct 31 10:35:55 storage1 gnbd_serv[5073]: server process 25400 exited because of signal 15 Oct 31 10:39:45 storage1 kernel: rebuilt 1 resources Oct 31 10:39:45 storage1 kernel: backups rebuilt 98 resources Oct 31 10:39:45 storage1 kernel: clvmd purge requests Oct 31 10:39:45 storage1 kernel: backups purge requests Oct 31 10:39:45 storage1 kernel: clvmd purged 0 requests Oct 31 10:39:45 storage1 kernel: backups purged 0 requests Oct 31 10:39:45 storage1 kernel: configs mark waiting requests Oct 31 10:39:45 storage1 kernel: configs marked 0 requests Oct 31 10:39:45 storage1 kernel: configs purge locks of departed nodes Oct 31 10:39:45 storage1 kernel: configs purged 11 locks Oct 31 10:39:45 storage1 kernel: configs update remastered resources Oct 31 10:39:45 storage1 kernel: configs updated 1 resources Oct 31 10:39:45 storage1 kernel: configs rebuild locks Oct 31 10:39:45 storage1 kernel: configs rebuilt 1 locks Oct 31 10:39:45 storage1 kernel: configs recover event 230 done Oct 31 10:39:45 storage1 kernel: configs move flags 0,0,1 ids 229,230,230 Oct 31 10:39:45 storage1 kernel: configs process held requests Oct 31 10:39:45 storage1 kernel: configs processed 0 requests Oct 31 10:39:45 storage1 kernel: configs resend marked requests Oct 31 10:39:45 storage1 kernel: configs resent 0 requests Oct 31 10:39:45 storage1 kernel: configs recover event 230 finished Oct 31 10:39:45 storage1 kernel: clvmd mark waiting requests Oct 31 10:39:45 storage1 kernel: clvmd marked 0 requests Oct 31 10:39:46 storage1 kernel: clvmd purge locks of departed nodes Oct 31 10:39:46 storage1 kernel: clvmd purged 5 locks Oct 31 10:39:46 storage1 kernel: clvmd update remastered resources Oct 31 10:39:46 storage1 kernel: clvmd updated 0 resources Oct 31 10:39:46 storage1 kernel: clvmd rebuild locks Oct 31 10:39:46 storage1 kernel: clvmd rebuilt 0 locks Oct 31 10:39:46 storage1 kernel: clvmd recover event 230 done Oct 31 10:39:46 storage1 kernel: Magma mark waiting requests Oct 31 10:39:46 storage1 kernel: Magma marked 0 requests Oct 31 10:39:46 storage1 kernel: Magma purge locks of departed nodes Oct 31 10:39:46 storage1 kernel: Magma purged 0 locks Oct 31 10:39:46 storage1 kernel: Magma update remastered resources Oct 31 10:39:46 storage1 kernel: Magma updated 0 resources Oct 31 10:39:46 storage1 kernel: Magma rebuild locks Oct 31 10:39:46 storage1 kernel: Oct 31 10:39:46 storage1 kernel: DLM: Assertion failed on line 105 of file /home/buildcentos/rpmbuild/BUILD/dlm-kernel-2.6.9-42/hugemem/src/rebuild.c Oct 31 10:39:46 storage1 kernel: DLM: assertion: "root->res_newlkid_expect" Oct 31 10:39:46 storage1 kernel: DLM: time = 2164169409 Oct 31 10:39:46 storage1 kernel: newlkid_expect=0 Oct 31 10:39:46 storage1 kernel: Oct 31 10:39:46 storage1 kernel: ------------[ cut here ]------------ Oct 31 10:39:46 storage1 kernel: kernel BUG at /home/buildcentos/rpmbuild/BUILD/dlm-kernel-2.6.9-42/hugemem/src/rebuild.c:105! Oct 31 10:39:46 storage1 kernel: invalid operand: 0000 [#1] Oct 31 10:39:46 storage1 kernel: SMP Oct 31 10:39:46 storage1 kernel: Modules linked in: ip_vs_wlc ip_vs lock_dlm(U) gfs(U) lock_harness(U) mptctl mptbase dell_rbu parport_pc lp parport autofs4 i2c_dev i2c_core gnbd(U) dlm(U) cman(U) sunrpc ipmi_devintf ipmi_si ipmi_msghandler iptable_filter ip_tables md5 ipv6 dm_mirror joydev button battery ac uhci_hcd ehci_hcd hw_random shpchp e1000 bonding(U) floppy sg ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi_mod Oct 31 10:39:46 storage1 kernel: CPU: 0 Oct 31 10:39:46 storage1 kernel: EIP: 0060:[<f8a2cfcd>] Not tainted VLI Oct 31 10:39:46 storage1 kernel: EFLAGS: 00010246 (2.6.9-42.0.2.ELhugemem) Oct 31 10:39:46 storage1 kernel: EIP is at have_new_lkid+0x79/0xb7 [dlm] Oct 31 10:39:46 storage1 kernel: eax: 00000001 ebx: dd76a0ec ecx: e1069e3c edx: f8a340dd Oct 31 10:39:46 storage1 kernel: esi: dd76a150 edi: 009803dc ebp: 39f2e400 esp: e1069e38 Oct 31 10:39:46 storage1 kernel: ds: 007b es: 007b ss: 0068 Oct 31 10:39:46 storage1 kernel: Process dlm_recvd (pid: 4314, threadinfo=e1069000 task=e13c1630) Oct 31 10:39:46 storage1 kernel: Stack: f8a340dd f8a34136 00000000 f8a34086 00000069 f8a3403b f8a3411d 80fe9ac1 Oct 31 10:39:46 storage1 kernel: 000002e8 00060028 f8a2e46b 6b914018 00000001 00000020 6b914000 39f2e400 Oct 31 10:39:46 storage1 kernel: 00000001 6b914000 f8a2e9f6 000002e8 00004040 00001000 de541580 00000001 Oct 31 10:39:46 storage1 kernel: Call Trace: Oct 31 10:39:46 storage1 kernel: [<f8a2e46b>] rebuild_rsbs_lkids_recv+0x99/0x106 [dlm] Oct 31 10:39:46 storage1 kernel: [<f8a2e9f6>] rcom_process_message+0x2e8/0x405 [dlm] Oct 31 10:39:46 storage1 kernel: [<f8a2ecfd>] process_recovery_comm+0x3c/0xa7 [dlm] Oct 31 10:39:46 storage1 kernel: [<f8a2ab8b>] midcomms_process_incoming_buffer+0x1bc/0x1f8 [dlm] Oct 31 10:39:46 storage1 kernel: [<02142d40>] buffered_rmqueue+0x17d/0x1a5 Oct 31 10:39:46 storage1 kernel: [<021204e9>] autoremove_wake_function+0x0/0x2d Oct 31 10:39:46 storage1 kernel: [<02142e1c>] __alloc_pages+0xb4/0x29d Oct 31 10:39:46 storage1 kernel: [<f8a28e01>] receive_from_sock+0x192/0x26c [dlm] Oct 31 10:39:46 storage1 kernel: [<f8a29cc9>] dlm_recvd+0x0/0x95 [dlm] Oct 31 10:39:46 storage1 kernel: [<f8a29b73>] process_sockets+0x56/0x91 [dlm] Oct 31 10:39:46 storage1 kernel: [<f8a29d4e>] dlm_recvd+0x85/0x95 [dlm] Oct 31 10:39:46 storage1 kernel: [<02133089>] kthread+0x73/0x9b Oct 31 10:39:46 storage1 kernel: [<02133016>] kthread+0x0/0x9b Oct 31 10:39:46 storage1 kernel: [<021041f5>] kernel_thread_helper+0x5/0xb Oct 31 10:39:46 storage1 kernel: Code: 41 a3 f8 68 3b 40 a3 f8 6a 69 68 86 40 a3 f8 e8 17 59 6f 09 ff 73 60 68 36 41 a3 f8 e8 0a 59 6f 09 68 dd 40 a3 f8 e8 00 59 6f 09 <0f> 0b 69 00 3b 40 a3 f8 83 c4 20 68 df 40 a3 f8 e8 55 50 6f 09 Oct 31 10:39:46 storage1 kernel: <0>Fatal exception: panic in 5 seconds -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster