thorsten.henrici@xxxxxx wrote: > > Hello, > I'm experiencing the following reproducable problem: > > I have two nodes A and B. I reboot node B and get the following syslog > on node B: > > Shutdown Node B - Syslog > ---------------------------------------------------------------------- > May 20 13:27:24 sdhhdewer38b shutdown: shutting down for system reboot > May 20 13:27:24 sdhhdewer38b init: Switching to runlevel: 6 > (...) > May 20 13:27:47 sdhhdewer38b rgmanager: [24641]: <notice> Shutting down > Cluster Service Manager... > May 20 13:27:47 sdhhdewer38b clurgmgrd[31332]: <notice> Shutting down > May 20 13:27:47 sdhhdewer38b clurgmgrd[31332]: <notice> Stopping service > s_ndb_mgmd_ip > May 20 13:27:47 sdhhdewer38b clurgmgrd: [31332]: <info> Removing IPv4 > address 10.112.24.20 from eth0 > May 20 13:27:57 sdhhdewer38b clurgmgrd[31332]: <notice> Service > s_ndb_mgmd_ip is stopped > May 20 13:27:59 sdhhdewer38b clurgmgrd[31332]: <notice> Shutdown > complete, exiting > May 20 13:28:00 sdhhdewer38b rgmanager: [24641]: <notice> Cluster > Service Manager is stopped. > (...) > May 20 13:28:21 sdhhdewer38b fenced: Stopping fence domain: > May 20 13:28:21 sdhhdewer38b fenced: shutdown succeeded > May 20 13:28:21 sdhhdewer38b fenced: > May 20 13:28:21 sdhhdewer38b fenced: > May 20 13:28:21 sdhhdewer38b rc: Stopping fenced: succeeded > May 20 13:28:21 sdhhdewer38b lock_gulmd: Stopping lock_gulmd: > May 20 13:28:21 sdhhdewer38b lock_gulmd: shutdown succeeded > May 20 13:28:21 sdhhdewer38b lock_gulmd: [ > May 20 13:28:21 sdhhdewer38b lock_gulmd: > May 20 13:28:21 sdhhdewer38b rc: Stopping lock_gulmd: succeeded > May 20 13:28:21 sdhhdewer38b cman: Stopping cman: > May 20 13:28:24 sdhhdewer38b cman: failed to stop cman failed > May 20 13:28:24 sdhhdewer38b cman: [ > May 20 13:28:24 sdhhdewer38b cman: > May 20 13:28:24 sdhhdewer38b rc: Stopping cman: failed > May 20 13:28:24 sdhhdewer38b ccsd: Stopping ccsd: > May 20 13:28:24 sdhhdewer38b ccsd[2564]: Stopping ccsd, SIGTERM received. > May 20 13:28:25 sdhhdewer38b ccsd: shutdown succeeded > May 20 13:28:25 sdhhdewer38b ccsd: > May 20 13:28:25 sdhhdewer38b ccsd: > May 20 13:28:25 sdhhdewer38b rc: Stopping ccsd: succeeded > ---------------------------------------------------------- > > Rebooting Node B crashes (kernel panic) Node B while starting the cman > service (loading the cman module) > That being already prettey bad, it even becomes worse. Node A leaves the > cluster - which brings all services running on it to a halt. > > I assume, that this behavior won't occur if I manually remove node B > from the cluster before rebooting. (I haven't tested yet, but will do as > soon as I have the chance to). > Nevertheless I think this behavior is a much too risky thing to have in > a production environment. Is this already known and is there any save > way to fix this? > > Syslogs of Node A and B during reboot: > > I'm runnig a self-written daemon-process that checks > /proc/cluster/status for the node's membership state. If the cluster > hasn't the status 'Member' for more than 60 seconds, I'm halting the > system to get into a consistent state. Call it something like self-fencing. > > Startup Node B - Syslog > -------------------------------------------------------------------------------------- > > May 20 13:34:19 sdhhdewer38b ccsd[2633]: Connected to cluster > infrastruture via: CMAN/SM Plugin v1.1.5 > May 20 13:34:19 sdhhdewer38b ccsd[2633]: Initial status:: Inquorate > May 20 13:34:19 sdhhdewer38b kernel: CMAN: sending membership request > May 20 13:34:19 sdhhdewer38b kernel: CMAN: sending membership request > May 20 13:34:19 sdhhdewer38b kernel: CMAN: got node sdhhdewer38a > May 20 13:36:13 sdhhdewer38b kernel: CMAN: removing node sdhhdewer38a > from the cluster : No response to messages > May 20 13:36:13 sdhhdewer38b kernel: ------------[ cut here ]------------ > May 20 13:36:13 sdhhdewer38b kernel: kernel BUG at > /usr/src/build/714635-i686/BUILD/cman-kernel-2.6.9-43/smp/src/membership.c:3150! > > May 20 13:36:13 sdhhdewer38b kernel: invalid operand: 0000 [#1] > May 20 13:36:13 sdhhdewer38b kernel: SMP > May 20 13:36:13 sdhhdewer38b kernel: Modules linked in: cman(U) sunrpc > md5 ipv6 dm_multipath button battery ac uhci_hcd ehci_hcd hw_random > bcm5700(U) floppy dm_snapshot dm_zero d > m_mirror ext3 jbd dm_mod qla6312(U) qla2400(U) qla2300(U) qla2xxx(U) > qla2xxx_conf(U) cciss sd_mod scsi_mod > May 20 13:36:13 sdhhdewer38b kernel: CPU: 2 > May 20 13:36:13 sdhhdewer38b kernel: EIP: 0060:[<f8ae1e2a>] Not > tainted VLI > May 20 13:36:13 sdhhdewer38b kernel: EFLAGS: 00010246 (2.6.9-34.ELsmp) > May 20 13:36:13 sdhhdewer38b kernel: EIP is at elect_master+0x2e/0x3a > [cman] > May 20 13:36:13 sdhhdewer38b kernel: eax: 00000000 ebx: f77c7fa0 > ecx: 00000080 edx: 00000080 > May 20 13:36:13 sdhhdewer38b kernel: esi: f8af5044 edi: f77c7fd8 > ebp: 00000000 esp: f77c7f98 > May 20 13:36:13 sdhhdewer38b kernel: ds: 007b es: 007b ss: 0068 > May 20 13:36:13 sdhhdewer38b kernel: Process cman_memb (pid: 2658, > threadinfo=f77c7000 task=f7638730) > May 20 13:36:13 sdhhdewer38b kernel: Stack: f8af4f08 f8adf8d1 c364eb00 > f6b23320 f8addeb7 f7638730 f7638730 f8ade09a > May 20 13:36:13 sdhhdewer38b kernel: 0000001f 00000000 f705e6b0 > 00000000 f7638730 c011e71b 00100100 00200200 > May 20 13:36:13 sdhhdewer38b kernel: 00000000 00000000 0000007b > f8added8 00000000 00000000 c01041f5 00000000 > May 20 13:36:13 sdhhdewer38b kernel: Call Trace: > May 20 13:36:13 sdhhdewer38b kernel: [<f8adf8d1>] > a_node_just_died+0x13a/0x199 [cman] > May 20 13:36:13 sdhhdewer38b kernel: [<f8addeb7>] > process_dead_nodes+0x4e/0x6f [cman] > May 20 13:36:13 sdhhdewer38b kernel: [<f8ade09a>] > membership_kthread+0x1c2/0x39d [cman] > May 20 13:36:13 sdhhdewer38b kernel: [<c011e71b>] > default_wake_function+0x0/0xc > May 20 13:36:13 sdhhdewer38b kernel: [<f8added8>] > membership_kthread+0x0/0x39d [cman] > May 20 13:36:13 sdhhdewer38b kernel: [<c01041f5>] > kernel_thread_helper+0x5/0xb > May 20 13:36:13 sdhhdewer38b kernel: Code: 28 5e af f8 89 c3 ba 01 00 00 > 00 39 ca 7d 1c a1 2c 5e af f8 8b 04 90 85 c0 74 0d 83 78 1c 02 75 07 89 > 03 8b 40 14 eb 0d 42 eb e0 <0f> 0 > b 4e 0c 73 8d ae f8 31 c0 5b c3 a1 2c 5e af f8 e8 79 10 67 > May 20 13:36:13 sdhhdewer38b kernel: <0>Fatal exception: panic in 5 > seconds > May 20 13:36:18 sdhhdewer38b cman: Timed-out waiting for cluster failed > May 20 13:36:18 sdhhdewer38b lock_gulmd: no <gulm> section detected in > /etc/cluster/cluster.conf succeeded > May 20 13:39:26 sdhhdewer38b syslogd 1.4.1: restart. > -------------------------------------------------------------------------------------- > > > Node A - Syslog > -------------------------------------------------------------------------------------- > > May 20 13:27:57 sdhhdewer38a clurgmgrd[31341]: <info> Magma Event: > Membership Change > May 20 13:27:57 sdhhdewer38a clurgmgrd[31341]: <info> State change: > sdhhdewer38b DOWN > May 20 13:28:00 sdhhdewer38a clurgmgrd[31341]: <notice> Starting stopped > service s_ndb_mgmd_ip > May 20 13:28:00 sdhhdewer38a clurgmgrd: [31341]: <info> Adding IPv4 > address 10.112.24.20 to eth0 > May 20 13:28:01 sdhhdewer38a clurgmgrd[31341]: <notice> Service > s_ndb_mgmd_ip started > May 20 13:34:19 sdhhdewer38a kernel: CMAN: node sdhhdewer38b rejoining > May 20 13:34:27 sdhhdewer38a logger: Node is not a member of the > Cluster. Membership state: Transition-Master > May 20 13:34:27 sdhhdewer38a logger: Node will be shut down in 50 seconds > May 20 13:34:37 sdhhdewer38a logger: Node is not a member of the > Cluster. Membership state: Transition-Master > May 20 13:34:37 sdhhdewer38a logger: Node will be shut down in 40 seconds > May 20 13:34:40 sdhhdewer38a clurgmgrd: [31341]: <info> Executing > /etc/init.d/httpd status > May 20 13:34:47 sdhhdewer38a logger: Node is not a member of the > Cluster. Membership state: Transition-Master > May 20 13:34:47 sdhhdewer38a logger: Node will be shut down in 30 seconds > May 20 13:34:57 sdhhdewer38a logger: Node is not a member of the > Cluster. Membership state: Transition-Master > May 20 13:34:57 sdhhdewer38a logger: Node will be shut down in 20 seconds > May 20 13:35:07 sdhhdewer38a logger: Node is not a member of the > Cluster. Membership state: Transition-Master > May 20 13:35:07 sdhhdewer38a logger: Node will be shut down in 10 seconds > May 20 13:35:11 sdhhdewer38a clurgmgrd: [31341]: <info> Executing > /etc/init.d/httpd status > May 20 13:35:18 sdhhdewer38a logger: Node is not a member of the > Cluster. Membership state: Transition-Master > May 20 13:35:18 sdhhdewer38a logger: Node will be shut down in 0 seconds > May 20 13:35:18 sdhhdewer38a logger: sdhhdewer38a is currently not a > cluster member. Shutting down to get into a consistent state ! > May 20 13:35:18 sdhhdewer38a logger: Killing the following processes > before shutdown: 31341 2742 2745 2743 2744 > May 20 13:35:18 sdhhdewer38a shutdown: shutting down for system reboot > -------------------------------------------------------------------------------------- > This is in bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=187777 a fix is in CVS. -- patrick -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster