On Mon, Jun 29, 2009 at 10:10:00PM +0200, Fabio M. Di Nitto wrote: > > 1246297857 fenced 3.0.0.rc3 started > > 1246297857 our_nodeid 1 our_name node2.foo.bar > > 1246297857 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log > > 1246297857 found uncontrolled entry /sys/kernel/dlm/rgmanager And it also leads to: dlm_controld[14981]: fenced_domain_info error -1 so it's not possible to get the node back without rebooting. > It looks to me the node has not been shutdown properly and an attempt to > restart it did fail. The fenced segfault shouldn't happen but I am > CC'ing David. Maybe he has a better idea. > > > > > when trying to restart fenced. Since this is not possible one has to > > reboot the node. > > > > We're also seeing: > > > > Jun 29 19:29:03 node2 kernel: [ 50.149855] dlm: no local IP address has been set > > Jun 29 19:29:03 node2 kernel: [ 50.150035] dlm: cannot start dlm lowcomms -107 > > hmm this looks like a bad configuration to me or bad startup. > > IIRC dlm kernel is configured via configfs and probably it was not > mounted by the init script. It is. > > from time to time. Stopping/starting via cman's init script (as from the > > Ubuntu package) several times makes this go away. > > > > Any ideas what causes this? > > Could you please try to use our upstream init scripts? They work just > fine (unchanged) in ubuntu/debian environment and they are for sure a > lot more robust than the ones I originally wrote for Ubuntu many years > ago. Tested that without any notable change. > Could you also please summarize your setup and config? I assume you did > the normal checks such as cman_tool status, cman_tool nodes and so on... > > The usual extra things I'd check are: > > - make sure the hostname doesn't resolve to localhost but to the real ip > address of the cluster interface > - cman_tool status > - cman_tool nodes These all do look o.k. However: > - Before starting any kind of service, such as rgmanager or gfs*, make > sure that the fencing configuration is correct. Test by using fence_node > $nodename. fence_node node1 gives the segfaults at the same locationo as described above which seems to be the cause of the trouble. (Howvever "fence_ilo -z -l user -p pass -a iloip" works as expected). The segfault happens in fence/libfence/agent.c's make_args where the second XPath lookup (FENCE_DEVICE_ARGS_PATH) returns a bogus (non NULL) str. Doing this xpath lookup by hand looks fine. So it seems ccs_get_list is returning corrupted pointers. I've attached the current clluster.conf. Cheers, -- Guido
?xml version="1.0"?> <cluster config_version="5" name="cl"> <cman two_node="1" expected_votes="2"> </cman> <dlm log_debug="1"/> <clusternodes> <clusternode name="node1.foo.bar" nodeid="1" votes="1"> <fence> <method name="1"> <device name="fence1"/> </method> </fence> </clusternode> <clusternode name="node2.foo.bar" nodeid="2" votes="1"> <fence> <method name="1"> <device name="fence2"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice agent="fence_ilo" hostname="rnode1.foo.bar" login="reboot" name="node1" passwd="pass"/> <fencedevice agent="fence_ilo" hostname="rnode2.foo.bar" login="reboot" name="node2" passwd="pass"/> </fencedevices> <rm log_level="7"> <failoverdomains> <failoverdomain name="kvm-hosts" ordered="1"> <failoverdomainnode name="node1.foo.bar"/> <failoverdomainnode name="node2.foo.bar"/> </failoverdomain> </failoverdomains> <resources> <virt name="test11" /> <virt name="test12" /> </resources> <service name="test11"> <virt ref="test11"/> </service> <service name="test12"> <virt ref="test12"/> </service> </rm> </cluster>
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster