Hi all, first mail to this mailing list. I'm experimenting with the STABLE2 branch (using cluster-2.03.11 release) on a couple of gentoo servers (2 node cluster) using DRBD in primary/primary. I use rhcs for clvm, fencing, and failover of services (kvm with libvirt and a primary/secondary drbd device used for backups). Every node has 3 gbit ethernet interfaces, two of them trunked in a bond device and used for drbd replication and cluster communication, while the other as the public interface. cluster.conf is attached. I've gone through all the step, configured cman, fenced using ipmi lan, rgmanager (with vm.sh taken from git to use libvirt) and everything is working as expected. At least issuing clusvcadm -M vm:vm01 -m node2 makes the machine migrate to the other node. Similary enabling/disabling/relocating a vm works too. Obviously there's a problem :) While testing the failover I noticed a behaviour similar to what reported on the ML in april http://www.mail-archive.com/linux-cluster@xxxxxxxxxx/msg05919.html issuing a power off using ipmi on a node to simulate a failure I saw in the log files: fenced[9592]: node2 not a cluster member after 0 sec post_fail_delay fenced[9592]: fencing node "node2" fenced[9592]: can't get node number for node <garbage_here> fenced[9592]: fence "node2" success clustat then showed node2 as offline but its services were still marked as "started" on the fenced node2. When node2 came back services did not relocate back. I tried to trace the problem in the code, and found that in cluster-2.03.11/fence/fenced/agent.c 313 if (ccs_lookup_nodename(cd, victim, &victim_nodename) == 0) 314 victim = victim_nodename; then on line 358 victim_nodename is freed 357 if (victim_nodename) 358 free(victim_nodename); and than update_cman is called with "victim" as node name, failing as the nodeid could not be retrieved (and garbage printed to syslog) 361 if (!error) { 362 update_cman(victim, good_device); 363 break; I admit that I miss why ccs_lookup_nodename returns 0, but delaying the free call after the update_cman call makes everything works, services relocate to the other node and when node2 comes back and rejoins the cluster they migrate back to the original node, as expected. Complete patch: diff -Nuar a/fence/fenced/agent.c b/fence/fenced/agent.c --- a/fence/fenced/agent.c 2009-01-22 13:33:51.000000000 +0100 +++ b/fence/fenced/agent.c 2009-07-14 01:19:26.385518781 +0200 @@ -354,14 +354,14 @@ if (device) free(device); - if (victim_nodename) - free(victim_nodename); free(method); if (!error) { update_cman(victim, good_device); break; } + if (victim_nodename) + free(victim_nodename); } ccs_disconnect(cd); The question is: should I open a bug on bugzilla? Or is my setup (gentoo, vm.sh backported, etc) too unusual for this to being useful? Or is it just a problem in the configuration? Sorry for my English but I'm not a native speaker. Regards, Giacomo
Attachment:
cluster.conf
Description: application/xml
Attachment:
signature.asc
Description: This is a digitally signed message part
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster