On Wed, Jun 20, 2007 at 05:57:05PM -0500, Chris Harms wrote: > My nodes were set to "quorum=1 two_node=1" and fenced by DRAC cards > using telnet over their NICs. The same NICs used in my bonded config on > the OS so I assumed it was on the same network path. Perhaps I assume > incorrectly. That sounds mostly right. The point is that a node disconnected from the cluster must not be able to fence a node which is supposedly still connected. That is: 'A' must not be able to fence 'B' if 'A' becomes disconnected from the cluster. However, 'A' must be able to be fenced if 'A' becomes disconnected. Why was DRAC unreachable; was it unplugged too? (Is DRAC like IPMI - in that it shares a NIC with the host machine?) > Desired effect would be survivor claims service(s) running on > unreachable node and attempts to fence unreachable node or bring it back > online without fencing should it establish contact. Actual result was > survivor spun its wheels trying to fence unreachable node and did not > assume services. Yes, this is an unfortunate limitation of using (most) integrated power management systems. Basically, some BMCs share a NIC with the host (IPMI), and some run off of the machine's power supply (IPMI, iLO, DRAC). When the fence device becomes unreachable, we don't know whether it's a total network outage or a "power disconnected" state. * If the power to a node has been disconnected, it's safe to recover. * If the node just lost all of its network connectivity, it's *NOT* safe to recover. * In both cases, we can not confirm the node is dead... which is why we don't recover. > Restoring network connectivity induced the previously > unreachable node to reboot and the surviving node experienced some kind > of weird power off and then powered back on (???). That doesn't sound right; the surviving node should have stayed put (not rebooted). > Ergo I figured I must need quorum disk so I can use something like a > ping node. My present plan is to use a loop device for the quorum disk > device and then setup ping heuristics. Will this even work, i.e. do the > nodes both need to see the same qdisk or can I fool the service with a > loop device? I don't believe the effect of tricking qdiskd in this way have been explored; I don't see why it wouldn't work in theory, but... qdiskd with or without a disk won't fix the behavior you experienced (uncertain state due to failure to fence -> retry / wait for node to come back). > I am not deploying GFS or GNDB and I have no SAN. My only > option would be to add another DRBD partition for this purpose which may > or may not work. > What is the proper setup option, two_node=1 or qdisk? In your case, I'd say two_node="1". -- Lon Hohberger - Software Engineer - Red Hat, Inc. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster