ls -lr /var/lib/openais If there are core files openais has crashed for some reason. If this is the issue contact me off list. Regards -steve On Fri, 2008-03-14 at 13:16 +0200, Volkan YAZICI wrote: > Oops! Here is the cluster.conf file. > On Fri, 14 Mar 2008, Volkan YAZICI <yazicivo@xxxxxxxxxx> writes: > > We have two RHEL5.1 boxes installed on IBM X3850 machines sharing a > > single DS4700 SAN with IBM 2005-B16 fence devices. System is configured > > as a high-availability system for database systems. We are facing > > serious non-deterministic (can happen in anywhere, at anytime without a > > single clue) problems. > > > > One of the most repeating problems are fence_tool related. > > > > # service cman start > > Starting cluster: > > Loading modules... done > > Mounting configfs... done > > Starting ccsd... done > > Starting cman... done > > Starting daemons... done > > Starting fencing... fence_tool: can't communicate with fenced -1 > > > > # fenced -D > > 1204556546 cman_init error 0 111 > > > > # clustat > > CMAN is not running. > > > > # cman_tool join > > > > # clustat > > msg_open: Connection refused > > Member Status: Quorate > > Member Name ID Status > > ------ ---- ---- ------ > > mobilizc1 1 Online, Local > > mobilizc2 2 Offline > > > > > > # groupd -D > > 1204556993 cman: our nodeid 1 name mobilizc1 quorum 1 > > 1204556993 found uncontrolled kernel object rgmanager in /sys/kernel/dlm > > 1204556993 found uncontrolled kernel object clvmd in /sys/kernel/dlm > > 1204556993 local node must be reset to clear 2 uncontrolled instances of gfs and/or dlm > > > > Sometimes this problem gets solved if the two machines are rebooted at > > the same time. But in the current HA configuration, I cannot guarantee > > two systems will be rebooted at the same time for every problem we > > face. At least one of them should start without a problem. > > > > Moreover, we were facing problems with the rgmanager. Below are the > > related /var/log/messages lines: > > > > kernel: clurgmgrd[4801]: segfault at 0000000000000000 rip 0000000000408905 rsp 00007fff9075f0b0 error 4 > > clurgmgrd[4800]: <crit> Watchdog: Daemon died, rebooting... > > > > We contacted with our RH support and they asked for a clurgmgrd > > backtrace from use. But unfortunately, we couldn't manage to start cman > > service to be able to start clurgmgrd. (You are asking why we couldn't > > cman? Really dunno. Same "fence_tool: can't communicate with fenced -1" > > problem. As I said previously, it sometimes works, sometimes doesn't > > work.) Later, they sent new not-released-yet > > rgmanager-2.0.36-1.el5.x86_64.rpm to us to try. Somehow, we managed to > > stnart cman on both machines and then started rgmanager service with this > > new rgmanager RPM. (Couldn't reproduce clurgmgrd SegFault.) And this > > solved clurgmgrd SegFault problem. But we are still having "can't > > communicate with fenced -1" errors occasionally. > > > > Sorry for the long post, but I try to help to people who will try to > > help to figure out the problem. I also attach my cluster.conf file with > > the post. Any kind of help will be really, really appreciated! Thanks so > > much for your kindly interest by reading this far. > > > > > > Regards. > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster