On Fri, 2006-03-24 at 11:06 +0100, Matteo Catanese wrote: > Hi Lon, > you mail is "music" for my ears :D > > I will try your /sbin/fence_dontcare immediately. Best wishes! If it breaks, all of the pieces are yours to keep. > i dont want to be interrupted in weekends when i play my > favourite video game (WOW) just because ONE component broke and all > cluster hung :-) Great game. > Sure our hardware configuration can sustain also some multi-point > failure, but NSPOF is our mail goal Remember that a redundant remote power switch doesn't obviate the need for iLO. iLO is *much* more than a power button. It has remote console abilities and other management stuff -- all which is very useful for system administration and maintenance. In my opinion, the power-button feature of iLO is the *least* useful part. > In my case WTI should be useful only in case of multiple failure, for > example both network switch fails so heartbeat fails and ilo fails > too and with /sbin/fence_dontcare i will have corruption. Is this > correct ? With the dontcare hack, you can have corruption if the node stops heartbeating (for any reason) and iLO does not respond at the time fence_ilo is called. Examples - Live-hang of the node with the iLO disconnected, too much system load to get out heartbeats, network congestion/saturation, bad cables, routing problems, internal problem in the switch, ARP storms, power surges, iLO bugs/failure, too many people logged in to iLO, etc. I do not know all of the possible the failure case(s). That is why the last cluster I set up has a remote power controller, even though all of the nodes individually have iLO as well. Call me paranoid if you want, but please, think about these two points: (1) Uptime with corrupt data does not equal availability ... and, more importantly ... (2) It *really* sucks to have to restore from backup when you could be playing WoW... > I will need a supplemental NIC for every server to connect to WTI, Actually, it should be on the same network as the cluster uses for communications, especially in two-node CMAN/DLM clusters; check out: http://people.redhat.com/teigland.sca.pdf -- Lon -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster