Lon Hohberger wrote: On Tue, 2007-11-20 at 14:06 -0800, Scott Becker wrote:So assuming only one real failure at a time, I'm thinking of making the first step in the fencing method a check for pinging the gateway. That way when a node wants to fence, it will only be able to if it's public NIC is working, even though it's using the private nic for the rest of the fencing.That's a pretty good + simple idea. -- Lon I got the fencing setup and performed a test. My "Fencing Heuristic" worked but I ran into another problem. For some unrelated reason, the good node did not attempt to fence the bad node. At the same time it also did not take over the service. For my simulated nic failure, I unplugged the cables from the public nic on node 3 (node 2 is the other). Node 3 was running the IP Address service. Here's all the data I know to fetch, what else can I provide? The very last line (from the disconnected node) is my fence script hack properly stopping fencing since the public nic is out. Further down the log (not shown here) the fencing is attempted repeatedly. It was my understanding that it would only try each method once. Help! clustat from node 2 before failure test: Member Status: Quorate Member Name ID Status ------ ---- ---- ------ 205.234.65.132 2 Online, Local, rgmanager 205.234.65.133 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:Web Server A 205.234.65.133 started clustat from node 2 after pulling node 3: Member Status: Quorate Member Name ID Status ------ ---- ---- ------ 205.234.65.132 2 Online, Local, rgmanager 205.234.65.133 3 Offline Service Name Owner (Last) State ------- ---- ----- ------ ----- service:Web Server A 205.234.65.133 started /etc/cluster/cluster.conf: <?xml version="1.0"?> <cluster alias="bxwa" config_version="8" name="bxwa"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="205.234.65.132" nodeid="2" votes="1"> <fence> <method name="1"> <device name="RackPDU1" option="off" port="2"/> <device name="RackPDU2" option="off" port="2"/> </method> </fence> </clusternode> <clusternode name="205.234.65.133" nodeid="3" votes="1"> <fence> <method name="1"> <device name="RackPDU1" option="off" port="3"/> <device name="RackPDU2" option="off" port="3"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice agent="fence_apc" ipaddr="192.168.7.11" login="root" name="RackPDU1" passwd_script="/root/cluster/rack_pdu"/> <fencedevice agent="fence_apc" ipaddr="192.168.7.12" login="root" name="RackPDU2" passwd_script="/root/cluster/rack_pdu"/> </fencedevices> <rm> <failoverdomains/> <resources/> <service autostart="1" exclusive="0" name="Web Server Address" recovery="relocate"> <ip address="205.234.65.138" monitor_link="1"/> </service> </rm> </cluster> Node 2, /var/log/messages: openais[9498]: [TOTEM] The token was lost in the OPERATIONAL state. openais[9498]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). openais[9498]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). openais[9498]: [TOTEM] entering GATHER state from 2. openais[9498]: [TOTEM] entering GATHER state from 0. openais[9498]: [TOTEM] Creating commit token because I am the rep. openais[9498]: [TOTEM] Saving state aru 47 high seq received 47 openais[9498]: [TOTEM] Storing new sequence id for ring 6c openais[9498]: [TOTEM] entering COMMIT state. openais[9498]: [TOTEM] entering RECOVERY state. openais[9498]: [TOTEM] position [0] member 205.234.65.132: openais[9498]: [TOTEM] previous ring seq 104 rep 205.234.65.132 openais[9498]: [TOTEM] aru 47 high delivered 47 received flag 1 openais[9498]: [TOTEM] Did not need to originate any messages in recovery. openais[9498]: [TOTEM] Sending initial ORF token openais[9498]: [CLM ] CLM CONFIGURATION CHANGE openais[9498]: [CLM ] New Configuration: kernel: dlm: closing connection to node 3 fenced[9568]: 205.234.65.133 not a cluster member after 0 sec post_fail_delay openais[9498]: [CLM ] r(0) ip(205.234.65.132) openais[9498]: [CLM ] Members Left: openais[9498]: [CLM ] r(0) ip(205.234.65.133) openais[9498]: [CLM ] Members Joined: openais[9498]: [CLM ] CLM CONFIGURATION CHANGE openais[9498]: [CLM ] New Configuration: openais[9498]: [CLM ] r(0) ip(205.234.65.132) openais[9498]: [CLM ] Members Left: openais[9498]: [CLM ] Members Joined: openais[9498]: [SYNC ] This node is within the primary component and will provide service. openais[9498]: [TOTEM] entering OPERATIONAL state. openais[9498]: [CLM ] got nodejoin message 205.234.65.132 openais[9498]: [CPG ] got joinlist message from node 2 Node 3, /var/log/messages: kernel: bonding: bond0: now running without any active interface ! openais[2921]: [TOTEM] The token was lost in the OPERATIONAL state. openais[2921]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). openais[2921]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). openais[2921]: [TOTEM] entering GATHER state from 2. clurgmgrd: [3759]: <warning> Link for bond0: Not detected clurgmgrd: [3759]: <warning> No link on bond0... clurgmgrd[3759]: <notice> status on ip "205.234.65.138" returned 1 (generic error) clurgmgrd[3759]: <notice> Stopping service service:Web Server Address openais[2921]: [TOTEM] entering GATHER state from 0. openais[2921]: [TOTEM] Creating commit token because I am the rep. openais[2921]: [TOTEM] Saving state aru 47 high seq received 47 openais[2921]: [TOTEM] Storing new sequence id for ring 6c openais[2921]: [TOTEM] entering COMMIT state. openais[2921]: [TOTEM] entering RECOVERY state. openais[2921]: [TOTEM] position [0] member 205.234.65.133: openais[2921]: [TOTEM] previous ring seq 104 rep 205.234.65.132 openais[2921]: [TOTEM] aru 47 high delivered 47 received flag 1 openais[2921]: [TOTEM] Did not need to originate any messages in recovery. openais[2921]: [TOTEM] Sending initial ORF token openais[2921]: [CLM ] CLM CONFIGURATION CHANGE openais[2921]: [CLM ] New Configuration: kernel: dlm: closing connection to node 2 openais[2921]: [CLM ] r(0) ip(205.234.65.133) openais[2921]: [CLM ] Members Left: fenced[2937]: 205.234.65.132 not a cluster member after 0 sec post_fail_delay openais[2921]: [CLM ] r(0) ip(205.234.65.132) openais[2921]: [CLM ] Members Joined: fenced[2937]: fencing node "205.234.65.132" openais[2921]: [CLM ] CLM CONFIGURATION CHANGE openais[2921]: [CLM ] New Configuration: openais[2921]: [CLM ] r(0) ip(205.234.65.133) openais[2921]: [CLM ] Members Left: openais[2921]: [CLM ] Members Joined: openais[2921]: [SYNC ] This node is within the primary component and will provide service. openais[2921]: [TOTEM] entering OPERATIONAL state. openais[2921]: [CLM ] got nodejoin message 205.234.65.133 openais[2921]: [CPG ] got joinlist message from node 3 fenced[2937]: agent "fence_apc" reports: Can not ping gateway |
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster