On Thu, Jun 28, 2007 at 07:54:05PM +0300, Janne Peltonen wrote: > On Thu, Jun 28, 2007 at 12:29:04PM -0400, Robert Gil wrote: > > I cant really help you there. In EL4 each of the services are separate. > > So a node can be part of the cluster but doesn't need to share the > > resources such as a shared san disk. If you have the resources set up so > > that it requires that resource, then it should be fenced. RHEL5 is the same FWIW, or should be. > *when a clurgmgrd starts, it wants to know the status of all the > services, and to make thing sure, it stops all services locally > (unmounts the filesystems, runs the scripts with "stop") - and asks the > already-running cluster members their idea of the status Right. > *when the clurgmgrd on the fifth node starts, it tries to stop locally > the SAN requiring services - and cannot match the /dev/<vg>/<lv> paths > with real nodes, so it ends up with incoherent information about their > status This should not cause a problem. > *if all the nodes with SAN access are restarted (while the fifth node is > up), the nodes with SAN access first stop the services locally - and > then, apparently, ask the fifth node about the service status. Result: > a line like the following, for each service: > > --cut-- > Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: <err> #34: Cannot get status for service service:im > --cut-- What do you mean here, (sorry, being daft) Restart all nodes = "just rgmanager on all nodes", or "reboot all nodes"? > (what is weird, though, is that the fifth node knows perfectly well the > status of this particular service, since it's running the service > (service:im doesn't need the SAN access) - perhaps there is some other > reason not to believe the fifth node at this point. can't imagine what > it'd be, though.) cman_tool services from each node could help here. > *after that, the nodes with SAN access do nothing about any services > until after the fifth node has left the cluster and has been fenced. If you're rebooting the other 4 nodes, it sounds like the 5th is holding some sort of a lock which it shouldn't be across quorum transitions (which would be a bug). If this is the case, could you: * install rgmanager-debuginfo * get me a backtrace: gdb clurgmgrd `pidof clurgmgrd` thr a a bt -- Lon -- Lon Hohberger - Software Engineer - Red Hat, Inc. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster