Hi, We have two RHEL5.1 boxes installed on IBM X3850 machines sharing a single DS4700 SAN with IBM 2005-B16 fence devices. System is configured as a high-availability system for database systems. We are facing serious non-deterministic (can happen in anywhere, at anytime without a single clue) problems. One of the most repeating problems are fence_tool related. # service cman start Starting cluster: Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... fence_tool: can't communicate with fenced -1 # fenced -D 1204556546 cman_init error 0 111 # clustat CMAN is not running. # cman_tool join # clustat msg_open: Connection refused Member Status: Quorate Member Name ID Status ------ ---- ---- ------ mobilizc1 1 Online, Local mobilizc2 2 Offline # groupd -D 1204556993 cman: our nodeid 1 name mobilizc1 quorum 1 1204556993 found uncontrolled kernel object rgmanager in /sys/kernel/dlm 1204556993 found uncontrolled kernel object clvmd in /sys/kernel/dlm 1204556993 local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm Sometimes this problem gets solved if the two machines are rebooted at the same time. But in the current HA configuration, I cannot guarantee two systems will be rebooted at the same time for every problem we face. At least one of them should start without a problem. Moreover, we were facing problems with the rgmanager. Below are the related /var/log/messages lines: kernel: clurgmgrd[4801]: segfault at 0000000000000000 rip 0000000000408905 rsp 00007fff9075f0b0 error 4 clurgmgrd[4800]: <crit> Watchdog: Daemon died, rebooting... We contacted with our RH support and they asked for a clurgmgrd backtrace from use. But unfortunately, we couldn't manage to start cman service to be able to start clurgmgrd. (You are asking why we couldn't cman? Really dunno. Same "fence_tool: can't communicate with fenced -1" problem. As I said previously, it sometimes works, sometimes doesn't work.) Later, they sent new not-released-yet rgmanager-2.0.36-1.el5.x86_64.rpm to us to try. Somehow, we managed to stnart cman on both machines and then started rgmanager service with this new rgmanager RPM. (Couldn't reproduce clurgmgrd SegFault.) And this solved clurgmgrd SegFault problem. But we are still having "can't communicate with fenced -1" errors occasionally. Sorry for the long post, but I try to help to people who will try to help to figure out the problem. I also attach my cluster.conf file with the post. Any kind of help will be really, really appreciated! Thanks so much for your kindly interest by reading this far. Regards. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster