Hi! I've come across a problem with two-node cluster on RHEL 4U3. When I attempt to reboot one of the nodes, it sometimes fails to leave cluster correctly. Before reboot, both nodes are cluster members and it is possible to fail-over services from one node to another. When I try to reboot node1 (active at that time), services fail-over to node2, however, cman fails to stop correctly: cman: Stopping cman: cman: failed to stop cman failed node2 logs following message: kernel: CMAN: removing node node1 from the cluster : Missed too many heartbeats I see no information about fencing attempts in the log. After node1's reboot, it is not able to rejoin cluster any more. node1: kernel: CMAN: Waiting to join or form a Linux-cluster kernel: CMAN: sending membership request kernel: CMAN: got node node2 cman: Timed-out waiting for cluster failed While on node2: kernel: CMAN: node node1 rejoining and after ~4.5 minutes: kernel: CMAN: too many transition restarts - will die kernel: CMAN: we are leaving the cluster. Inconsistent cluster view kernel: WARNING: dlm_emergency_shutdown clurgmgrd[2848]: <warning> #67: Shutting down uncleanly kernel: WARNING: dlm_emergency_shutdown kernel: SM: 00000001 sm_stop: SG still joined kernel: SM: 01000003 sm_stop: SG still joined kernel: SM: 03000002 sm_stop: SG still joined ccsd[2242]: Cluster is not quorate. Refusing connection. ccsd[2242]: Error while processing connect: Connection refused ccsd[2242]: Invalid descriptor specified (-111). ccsd[2242]: Someone may be attempting something evil. ccsd[2242]: Error while processing get: Invalid request descriptor ccsd[2242]: Invalid descriptor specified (-111). ccsd[2242]: Someone may be attempting something evil. ccsd[2242]: Error while processing get: Invalid request descriptor ccsd[2242]: Invalid descriptor specified (-21). and again ~1 minute later on node1: kernel: CMAN: removing node node2 from the cluster : No response to messages kernel: ------------[ cut here ]------------ kernel: kernel BUG at /usr/src/build/714635-i686/BUILD/cman-kernel-2.6.9-43/smp/src/membership.c:3150! kernel: invalid operand: 0000 [#1] kernel: SMP kernel: Modules linked in: cman(U) md5 ipv6 iptable_filter ip_tables button battery ac uhci_hcd ehci_hcd hw_random tg3 floppy sg st mptspi mptscsi mptbase dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod cciss sd_mod scsi_mod kernel: CPU: 0 kernel: EIP: 0060:[<f8bebe2a>] Not tainted VLI kernel: EFLAGS: 00010246 (2.6.9-34.ELsmp) kernel: EIP is at elect_master+0x2e/0x3a [cman] kernel: eax: 00000000 ebx: f7b4afa0 ecx: 00000080 edx: 00000080 kernel: esi: f8bff044 edi: f7b4afd8 ebp: 00000000 esp: f7b4af98 kernel: ds: 007b es: 007b ss: 0068 kernel: Process cman_memb (pid: 2429, threadinfo=f7b4a000 task=c1a33230) kernel: Stack: f8bfef08 f8be98d1 c1a7c580 f6e8ee00 f8be7eb7 c1a33230 c1a33230 f8be809a kernel: 0000001f 00000000 f7b460b0 00000000 c1a33230 c011e71b 00100100 00200200 kernel: 00000000 00000000 0000007b f8be7ed8 00000000 00000000 c01041f5 00000000 kernel: Call Trace: kernel: [<f8be98d1>] a_node_just_died+0x13a/0x199 [cman] kernel: [<f8be7eb7>] process_dead_nodes+0x4e/0x6f [cman] kernel: [<f8be809a>] membership_kthread+0x1c2/0x39d [cman] kernel: [<c011e71b>] default_wake_function+0x0/0xc kernel: [<f8be7ed8>] membership_kthread+0x0/0x39d [cman] kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb kernel: Code: 28 fe bf f8 89 c3 ba 01 00 00 00 39 ca 7d 1c a1 2c fe bf f8 8b 04 90 85 c0 74 0d 83 78 1c 02 75 07 89 03 8b 40 14 eb 0d 42 eb e0 <0f> 0b 4e 0c 73 2d bf f8 31 c0 5b c3 a1 2c fe bf f8 e8 79 70 56 kernel: <0>Fatal exception: panic in 5 seconds During one other test, cluster did not crash, it just ended in the state, when cman on rebooted node kept sending cluster membership requests and those requests were ignored by other cluster node. Output of tcpdump showed traffic was reaching active node, but there was no reply nor any message in the logs of active node. Only way to get to normal state is to restart cman on active node (or reboot both nodes). If I try to reboot one of cluster nodes shortly after rebooting both nodes, it seems to leave and rejoin cluster successfully. Has anyone observed similar behavior? Is this known bug in U3, which can be resolved by upgrade to latest version? I've checked changelogs and release notes (Btw, any chance to get back to "old" release notes format for RHCS? Release notes for U5 do not longer list fixed bugzilla reports, only links some errata listings, which do not seem to be accessible from Internet.), but haven't found any obvious reference to this king of problem. Ideas appreciated. th. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster