Cosimo, fencing takes place any time a condition exists, where "the cluster" cannot communicate with a node, or cannot gaurentee the state of a particular node. Other can surely do the question more justtice but in a nutshell that's it. To test this, simply pull the netwrok cable from a node. The others will not be able to status it and it will get fenced. Same thing happens if you call 'fence_node node01' or whatever your nodes are named. The machine will actually be booted _twice_. Once from the command, then another when the cluster decides it's no longer talking. I think the fence command should at least have an option to inform the cluster that a node was fenced but it's not a big deal. If oom is killing your cluster nodes, I think your out of luck. GFS can gobble memory from my experience. More is better.. Also, in GFS 6.0x there is a bug that causes system ram to be exhausted by GFS locks. The newest release has a tunable paramter "inoded_purge" which allows you to tune a periodic percentage of locks to try and purge. This helped me a LOT. I was having nodes hang cuz nodes could not fork. BTW, if the GFS folks are reading this, I'd like ot make a suggestion. I have not gone code diving yet but it seems that if the mechanism for a node to respond was actually spawning a thread or something that required the system to be able to fork then systems that are starved of memory would indeed get fenced since the "OK" response would not get back to the cluster. I realize that doesn't FIX anything per se' but it would prevent the system from hanging for any length of time. On the start/stop of SAN resources, what exactly do you mean? It sounds like you are talking about what happens when qlogic drivers load and unload. If that's the case, you need to properly set up zoning on your fibre switch. The load/unload of the qlogic drivers causes a scsi reset to be sent along the bus, which in the case of fibre channel, is every device in the fibre mesh. You need to set up individual zones for your storage ports, then zones which include the host ports, and the storage together. So on a 5 node cluster, you'd end up with 5 zones, one for storage, and 4 host/storage combos, then make them all part of the active config. That way any scsi resets are not seen by other nodes HBA's. I had problems that were causeing nodes to go down due to lost connections to the storage from the scsi resets, not good.... Heartbeat should not need any tweaking if everything else is working. Not to say you can't tune it to your situation, just that it should be fine with default settings while you get things stable. Hope this helps Corey -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Cosimo Streppone Sent: Monday, May 08, 2006 3:31 PM To: linux clustering Subject: Re: Recommended HP servers for cluster suite Kovacs, Corey J. wrote: > iLO fencing works just fine. > [...] > If you are using RHEL4 + GFS 6.1, then it is simpler since the > config is expected to be in the same file etc. > > [...] I seem to have got past the SSL modules installation, so that is not the problem. Thanks for sharing your experience, but I admit I still haven't understood when fencing takes place. What are the conditions that trigger fencing? > Any specific problem you are having? Yes. The main problem is that I'm now beginning to find my way through RHCS4. :-) Other random problems that I had: - oom-killer kernel thread killed my ccs daemon, causing the entire two-node cluster to suddenly become unmanageable; - start/stop of shared filesystem resources (SAN) is causing errors and is therefore not managed properly; - don't know how to properly configure heartbeat; I know these are not iLO problem. In fact, I'm trying to solve one problem at a time, and don't know if iLO fencing can be the cause of these problems. I need to do some more researching. I'll be back with more useful info. -- Cosimo -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster