Hi all, I configured GFS over DRBD (active-active) with RHCS and IPMI as fence device. When I try to mount my GFS resource, my interconnect interface goes down and one node is fenced. This happen every time. DRBD joins and become primary... Jun 18 19:04:30 alice kernel: drbd0: Handshake successful: Agreed network protocol version 89 Jun 18 19:04:30 alice kernel: drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC Jun 18 19:04:30 alice kernel: drbd0: conn( WFConnection -> WFReportParams ) Jun 18 19:04:30 alice kernel: drbd0: Starting asender thread (from drbd0_receiver [3315]) Jun 18 19:04:30 alice kernel: drbd0: data-integrity-alg: <not-used> Jun 18 19:04:30 alice kernel: drbd0: drbd_sync_handshake: Jun 18 19:04:30 alice kernel: drbd0: self 2BA45318C0A122D1:CBAA0E591815072F:3F39591B4EF90EDD:2E40DDEB552666B9 Jun 18 19:04:30 alice kernel: drbd0: peer CBAA0E591815072E:0000000000000000:3F39591B4EF90EDD:2E40DDEB552666B9 Jun 18 19:04:30 alice kernel: drbd0: uuid_compare()=1 by rule 7 Jun 18 19:04:30 alice kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) Jun 18 19:04:30 alice kernel: drbd0: peer( Secondary -> Primary ) Jun 18 19:04:31 alice kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) Jun 18 19:04:31 alice kernel: drbd0: Began resync as SyncSource (will sync 16384 KB [4096 bits set]). Jun 18 19:04:33 alice kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 16384 K/sec) Jun 18 19:04:33 alice kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Then the fence domain is OK: Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering GATHER state from 11. Jun 18 19:04:35 alice openais[3475]: [TOTEM] Creating commit token because I am the rep. Jun 18 19:04:35 alice openais[3475]: [TOTEM] Saving state aru 1b high seq received 1b Jun 18 19:04:35 alice openais[3475]: [TOTEM] Storing new sequence id for ring 34 Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering COMMIT state. Jun 18 19:04:35 alice openais[3475]: [TOTEM] entering RECOVERY state. Jun 18 19:04:35 alice openais[3475]: [TOTEM] position [0] member 10.17.44.116: Jun 18 19:04:35 alice openais[3475]: [TOTEM] previous ring seq 48 rep 10.17.44.116 Jun 18 19:04:35 alice openais[3475]: [TOTEM] aru 1b high delivered 1b received flag 1 Jun 18 19:04:35 alice openais[3475]: [TOTEM] position [1] member 10.17.44.117: Jun 18 19:04:35 alice openais[3475]: [TOTEM] previous ring seq 48 rep 10.17.44.117 Jun 18 19:04:35 alice openais[3475]: [TOTEM] aru a high delivered a received flag 1 Jun 18 19:04:35 alice openais[3475]: [TOTEM] Did not need to originate any messages in recovery. Jun 18 19:04:35 alice openais[3475]: [TOTEM] Sending initial ORF token Jun 18 19:04:35 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE Jun 18 19:04:36 alice openais[3475]: [CLM ] New Configuration: Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116) Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Left: Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Joined: Jun 18 19:04:36 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE Jun 18 19:04:36 alice openais[3475]: [CLM ] New Configuration: Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116) Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.117) Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Left: Jun 18 19:04:36 alice openais[3475]: [CLM ] Members Joined: Jun 18 19:04:36 alice openais[3475]: [CLM ] r(0) ip(10.17.44.117) Jun 18 19:04:36 alice openais[3475]: [SYNC ] This node is within the primary component and will provide service. Jun 18 19:04:36 alice openais[3475]: [TOTEM] entering OPERATIONAL state. Jun 18 19:04:36 alice openais[3475]: [CLM ] got nodejoin message 10.17.44.116 Jun 18 19:04:36 alice openais[3475]: [CLM ] got nodejoin message 10.17.44.117 Jun 18 19:04:36 alice openais[3475]: [CPG ] got joinlist message from node 1 Jun 18 19:04:40 alice kernel: dlm: connecting to 2 Jun 18 19:04:40 alice kernel: dlm: got connection from 2 WHY DOWN? Jun 18 19:04:53 alice kernel: eth2: Link is Down Jun 18 19:04:53 alice openais[3475]: [TOTEM] The token was lost in the OPERATIONAL state. Jun 18 19:04:53 alice openais[3475]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Jun 18 19:04:53 alice openais[3475]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Jun 18 19:04:53 alice openais[3475]: [TOTEM] entering GATHER state from 2. Jun 18 19:04:57 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:04:57 alice kernel: eth2: 10/100 speed: disabling TSO Something goes wrong with DRBD Jun 18 19:04:58 alice kernel: drbd0: PingAck did not arrive in time. Jun 18 19:04:58 alice kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) Jun 18 19:04:58 alice kernel: drbd0: asender terminated Jun 18 19:04:58 alice kernel: drbd0: Terminating asender thread Jun 18 19:04:58 alice kernel: drbd0: short read expecting header on sock: r=-512 Jun 18 19:04:58 alice kernel: drbd0: Creating new current UUID Jun 18 19:04:58 alice kernel: drbd0: Connection closed Jun 18 19:04:58 alice kernel: drbd0: conn( NetworkFailure -> Unconnected ) Jun 18 19:04:58 alice kernel: drbd0: receiver terminated Jun 18 19:04:58 alice kernel: drbd0: Restarting receiver thread Jun 18 19:04:58 alice kernel: drbd0: receiver (re)started Jun 18 19:04:58 alice kernel: drbd0: conn( Unconnected -> WFConnection ) Something goes wrong in the cluster Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering GATHER state from 0. Jun 18 19:04:58 alice openais[3475]: [TOTEM] Creating commit token because I am the rep. Jun 18 19:04:58 alice openais[3475]: [TOTEM] Saving state aru 3c high seq received 3c Jun 18 19:04:58 alice openais[3475]: [TOTEM] Storing new sequence id for ring 38 Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering COMMIT state. Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering RECOVERY state. Jun 18 19:04:58 alice openais[3475]: [TOTEM] position [0] member 10.17.44.116: Jun 18 19:04:58 alice openais[3475]: [TOTEM] previous ring seq 52 rep 10.17.44.116 Jun 18 19:04:58 alice openais[3475]: [TOTEM] aru 3c high delivered 3c received flag 1 Jun 18 19:04:58 alice openais[3475]: [TOTEM] Did not need to originate any messages in recovery. Jun 18 19:04:58 alice openais[3475]: [TOTEM] Sending initial ORF token Jun 18 19:04:58 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE Jun 18 19:04:58 alice openais[3475]: [CLM ] New Configuration: Jun 18 19:04:58 alice kernel: dlm: closing connection to node 2 Jun 18 19:04:58 alice fenced[3494]: bob not a cluster member after 0 sec post_fail_delay Jun 18 19:04:58 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116) "bob" node is fenced (it just joined!) Jun 18 19:04:58 alice fenced[3494]: fencing node "bob" Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Left: Jun 18 19:04:58 alice openais[3475]: [CLM ] r(0) ip(10.17.44.117) Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Joined: Jun 18 19:04:58 alice openais[3475]: [CLM ] CLM CONFIGURATION CHANGE Jun 18 19:04:58 alice openais[3475]: [CLM ] New Configuration: Jun 18 19:04:58 alice openais[3475]: [CLM ] r(0) ip(10.17.44.116) Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Left: Jun 18 19:04:58 alice openais[3475]: [CLM ] Members Joined: Jun 18 19:04:58 alice openais[3475]: [SYNC ] This node is within the primary component and will provide service. Jun 18 19:04:58 alice openais[3475]: [TOTEM] entering OPERATIONAL state. Jun 18 19:04:58 alice openais[3475]: [CLM ] got nodejoin message 10.17.44.116 Jun 18 19:04:58 alice openais[3475]: [CPG ] got joinlist message from node 1 Jun 18 19:05:03 alice kernel: eth2: Link is Down Jun 18 19:05:08 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:05:08 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:05:12 alice kernel: eth2: Link is Down Jun 18 19:05:13 alice fenced[3494]: fence "bob" success Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Trying to acquire journal lock... Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Looking at journal... Jun 18 19:05:13 alice kernel: GFS: fsid=webclima:web.0: jid=1: Done eth2 is up and down.... Jun 18 19:05:15 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:05:15 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:05:21 alice kernel: eth2: Link is Down Jun 18 19:05:24 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:05:24 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:05:29 alice kernel: eth2: Link is Down Jun 18 19:05:33 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:05:33 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:07:26 alice kernel: eth2: Link is Down Jun 18 19:07:29 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:07:29 alice kernel: eth2: 10/100 speed: disabling TSO Jun 18 19:07:36 alice kernel: eth2: Link is Down Jun 18 19:07:38 alice kernel: eth2: Link is Up 100 Mbps Full Duplex, Flow Control: None Jun 18 19:07:38 alice kernel: eth2: 10/100 speed: disabling TSO Consider that if I don't mount GFS, the node is not fenced and the failover domains becomes active. So, I guess the problem is in GFS... and not for example with the NIC. Here is my configuration: # cat /etc/drbd.conf global { usage-count no; } resource r1 { protocol C; syncer { rate 10M; verify-alg sha1; } startup { become-primary-on both; wfc-timeout 150; } disk { on-io-error detach; } net { allow-two-primaries; cram-hmac-alg "sha1"; shared-secret "123456"; after-sb-0pri discard-least-changes; after-sb-1pri violently-as0p; after-sb-2pri violently-as0p; rr-conflict violently; ping-timeout 50; } on alice { device /dev/drbd0; disk /dev/sda2; address 10.17.44.116:7789; meta-disk internal; } on bob { device /dev/drbd0; disk /dev/sda2; address 10.17.44.117:7789; meta-disk internal; } } # cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster alias="web" config_version="20" name="web"> <fence_daemon post_fail_delay="0" post_join_delay="6"/> <clusternodes> <clusternode name="alice" nodeid="1" votes="1"> <fence> <method name="1"> <device lanplus="" name="alice-ipmi"/> </method> </fence> </clusternode> <clusternode name="bob" nodeid="2" votes="1"> <fence> <method name="1"> <device lanplus="" name="bob-ipmi"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice agent="fence_ipmilan" auth="password" ipaddr="10.17.44.134" login="cnmca" name="alice-ipmi" passwd="xxxxxx"/> <fencedevice agent="fence_ipmilan" auth="password" ipaddr="10.17.44.135" login="cnmca" name="bob-ipmi" passwd="xxxxxx"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="alice-domain" ordered="1" restricted="1"> <failoverdomainnode name="alice" priority="1"/> <failoverdomainnode name="bob" priority="2"/> </failoverdomain> <failoverdomain name="bob-domain" ordered="1" restricted="1"> <failoverdomainnode name="bob" priority="1"/> <failoverdomainnode name="alice" priority="2"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.17.44.16" monitor_link="1"/> <ip address="10.17.44.17" monitor_link="1"/> </resources> <service autostart="1" domain="alice-domain" name="alice-alias" recovery="relocate"> <ip ref="10.17.44.16"/> </service> <service autostart="1" domain="bob-domain" name="bob-alias" recovery="relocate"> <ip ref="10.17.44.17"/> </service> </rm> </cluster> # cat /etc/hosts: 127.0.0.1 localhost.localdomain localhost 172.17.44.116 alice 172.17.44.117 bob # ifconfig bond0 Link encap:Ethernet HWaddr 00:15:17:51:70:38 inet addr:10.17.44.116 Bcast:10.17.44.255 Mask:255.255.255.0 inet6 addr: fe80::215:17ff:fe51:7038/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 RX packets:49984 errors:0 dropped:0 overruns:0 frame:0 TX packets:83669 errors:0 dropped:0 overruns:0 carrier:0 collisions:11221 txqueuelen:0 RX bytes:16151284 (15.4 MiB) TX bytes:102618030 (97.8 MiB) eth0 Link encap:Ethernet HWaddr 00:15:17:51:70:38 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:49984 errors:0 dropped:0 overruns:0 frame:0 TX packets:83669 errors:0 dropped:0 overruns:0 carrier:0 collisions:11221 txqueuelen:100 RX bytes:16151284 (15.4 MiB) TX bytes:102618030 (97.8 MiB) Memory:f9140000-f9160000 eth1 Link encap:Ethernet HWaddr 00:15:17:51:70:38 UP BROADCAST SLAVE MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Memory:f91a0000-f91c0000 eth2 Link encap:Ethernet HWaddr 00:19:99:29:08:8B inet addr:172.17.44.116 Bcast:172.17.44.255 Mask:255.255.255.0 inet6 addr: fe80::219:99ff:fe29:88b/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:20 errors:0 dropped:0 overruns:0 frame:0 TX packets:45 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 RX bytes:1200 (1.1 KiB) TX bytes:7902 (7.7 KiB) Memory:f9200000-f9220000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:3541 errors:0 dropped:0 overruns:0 frame:0 TX packets:3541 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:464552 (453.6 KiB) TX bytes:464552 (453.6 KiB) I hope there is someone just experienced this bad issue. Thanks in advance. -- Giuseppe -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster