Hello Jerome, Hope you're well. I didn't as things seem to work correctly with qdisk enabled. There is a chance the guys from support will tell me to re enable qdisk .....;-) I was hopping Lon or Chrissie to point me to some idea. Brem 2010/3/3 Jerome Fenal <jfenal@xxxxxxxxxx>: > Le mercredi 03 mars 2010 à 14:23 +0100, brem belguebli a écrit : >> Hi Xavier, > > Hi Brem, Xavier, > >> 2010/3/3 Xavier Montagutelli <xavier.montagutelli@xxxxxxxxx>: >> > On Wednesday 03 March 2010 03:11:50 brem belguebli wrote: >> >> Hi, >> >> >> >> I experienced a strange cluster behavior that I couldn't explain. >> >> >> >> I have a 4 nodes Rhel 5.4 cluster (node1, node2, node3 and node4). >> >> >> >> Node1 and node2 are connected to an ethernet switch (sw1), node3 and >> >> node4 are connected to another switch (sw2). The 4 nodes are on the same >> >> Vlan. >> >> >> >> sw1 and sw2 are connected thru a couple of core switches, and the nodes >> >> Vlan is well propagated across the network that I just described. >> >> >> >> Latency between node1 and node4 (on 2 different switches) doesn't exceed >> >> 0.3 ms. >> >> >> >> The cluster is normally configured with a iscsi quorum device located on >> >> another switch. >> >> >> >> I wanted to check how it would behave when quorum disk is not active >> >> (removed from cluster.conf) if a member node came to get isolated (link >> >> up but not on the right vlan). >> >> >> >> Node3 is the one I played with. >> >> >> >> The fence_device for this node is intentionally misconfigured to be able >> >> to follow on this node console what happens. >> >> >> >> When changing the vlan membership of node3, results are as expected, the >> >> 3 remaining nodes see it come offline after totem timer expiry, and >> >> node1 (lowest node id) starts trying to fence node3 (without success as >> >> intentionally misconfigured). >> >> >> >> Node3 sees itself the only member of the cluster which is inquorate. >> >> Coherent as it became a single node parition. >> >> >> >> When putting back node3 vlan conf to the right value, things go bad. >> > >> > (My two cents) >> > >> > You just put it back in the good VLAN, without restarting the host ? >> >> Yeap, this it what I wanted to test. >> >> > >> > I did this kind of test (under RH 5.3), and things always get bad if a node >> > supposed to be fenced is not really fenced and comes back. Perhaps this is an >> > intended behaviour to prevent "split brain" cases (even at the cost of the >> > whole cluster going down) ? Or perhaps it depends how your misconfigured fence >> > device behaves (does it give an exit status ? What exit status does it send >> > ?). > > +1 > >> When node3 comes back with the same membership state as previously, >> node1 (2 and 4) kill node3 (instruct cman to exit) because of this >> previous state being the same as the new one. >> >> The problem is that, in the log, node1 and node2 at the very same time >> loose the quorum ( clurgmgrd[10469]: <emerg> #1: Quorum Dissolved) and >> go offline. This is what I cannot explain. >> >> There is no split brain thing involved here as I expected node1 (and >> why not all the other nodes) to instruct node3 cman to exit and things >> could continue to run (may be without relocating node3 services as I >> couldn't get fenced). >> >> Concerning the fencing, it may return a non zero value as I can see in >> node1 logs that it is looping trying to fence node3. >> > >> >> >> >> Node1, 2 and 4 instruct node3 cman to kill itself as it did re appear >> >> with an already existing status. Why not. >> >> >> >> Node1 and node2 then say then the quorum is dissolved and see themselves >> >> offline (????), node3 offline and node4 online. >> >> >> >> Node4 sees itself online but cluster inquorate as we also lost node1 and >> >> node2. >> >> >> >> I thought about potential multicast problems, but it behaves the same >> >> way when cman is configured to broadcast. >> >> >> >> The same test run with qdisk enabled is behaving normally, when node3 >> >> gets back to network it gets automatically rebooted (thx to qdisk), the >> >> cluster remains stable. >> >> Concerning the fact that it works when qdisk is enabled may be a "side >> effect" as I use a iscsi LUN accessed through the LAN interface, qdisk >> being a "heartbeat vector" node3 not being able to write to the LUN >> may make things more stable. >> >> I should give a try with a SAN LUN used as qdisk and see how it behaves. > > One would benefit seeing the architecture details, configuration and > logs. > Did you open a ticket at our support to investigate this behaviour with > our experts ? > > Regards, > > J. > -- > Jérôme Fenal, RHCE Tel.: +33 1 41 91 23 37 > Solution Architect Mob.: +33 6 88 06 51 15 > Consultant Avant-ventes Fax.: +33 1 41 91 23 32 > http://www.fr.redhat.com/ jfenal@xxxxxxxxxx > Red Hat France SARL Siret n° 421 199 464 00064 > Le Linea, 1 rue du Général Leclerc 92047 Paris La Défense Cedex > Venez aux Red Hat Tech Happy Hours : http://www.redhat.fr/events/happy-hour/ > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster