On Thu, Jun 28, 2007 at 12:29:04PM -0400, Robert Gil wrote: > I cant really help you there. In EL4 each of the services are separate. > So a node can be part of the cluster but doesn't need to share the > resources such as a shared san disk. If you have the resources set up so > that it requires that resource, then it should be fenced. Yep. The situation seems to be this (someone who really knows abt the inner workings of the resource group manager, correct me): *when a clurgmgrd starts, it wants to know the status of all the services, and to make thing sure, it stops all services locally (unmounts the filesystems, runs the scripts with "stop") - and asks the already-running cluster members their idea of the status *when the clurgmgrd on the fifth node starts, it tries to stop locally the SAN requiring services - and cannot match the /dev/<vg>/<lv> paths with real nodes, so it ends up with incoherent information about their status *if all the nodes with SAN access are restarted (while the fifth node is up), the nodes with SAN access first stop the services locally - and then, apparently, ask the fifth node about the service status. Result: a line like the following, for each service: --cut-- Jun 28 17:56:20 pcn2.mappi.helsinki.fi clurgmgrd[5895]: <err> #34: Cannot get status for service service:im --cut-- (what is weird, though, is that the fifth node knows perfectly well the status of this particular service, since it's running the service (service:im doesn't need the SAN access) - perhaps there is some other reason not to believe the fifth node at this point. can't imagine what it'd be, though.) *after that, the nodes with SAN access do nothing about any services until after the fifth node has left the cluster and has been fenced. So, apparently the other nodes conclude that the fifth node is 'bad' and could be interfering with their SAN access requiring services. When the fifth node has been fenced, the other nodes start the services. And the fifth node can join the cluster and start the services that should be running there... > > -----Original Message----- > > From: linux-cluster-bounces@xxxxxxxxxx > > [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Janne Peltonen > > Sent: Thursday, June 28, 2007 11:46 AM > > To: linux-cluster@xxxxxxxxxx > > Subject: Cluster node without access to all resources > > -trouble > > > > Hi. > > > > I'm running a five node cluster. Four of the nodes run services that > > need access to a SAN, but the fifth doesn't. (The fifth node belongs > > to the cluster to avoid a cluster with an even number of nodes. > > Additionally, the fifth node is a stand-alone rack server, while the > > four other nodes are blade server, two of the in two different blade > > racks - this way, even if either of the blade racks goes down, I won't > > > lose the cluster.) This seems to create all sorts of trouble. For > > example, if I try to manipulate clvm'd filesystems on the other four > > nodes, they refuse to commit changes if the fifth node is up. And even > > > if I've restricted the SAN-access-needing services to run only on the > > four nodes that have the access, the cluster system tries to shut the > > services down in the fifth node also (when quorum is lost, for > > example) > > - and complains about being unable to stop them and, on the nodes that > > > should run the services, refuses to restart them until I've removed > > the fifth node from the cluster and fenced it. (Or, rather, I've > > removed the fifth node from the cluster and one of the other nodes has > > > successfully fenced it.) > > > > So. > > > > Is it really necessary that all the members in a cluster have access > > to all the resources that any of the members have, even if the > > services in the cluster are partitioned to run in only a part of the > > cluster? Or is there a way to tell the cluster that it shouldn't care > > about the fifth members opinion about certain services; that is, it > > doesn't need to check if the services are running on it, because they > > never do. Or should I just make sure that the fifth member always > > comes up last (that is, won't be running while the others are coming > > up)? Or should I aceept that I'm going to create more harm than > > avoiding by letting the fifth node belong to the cluster, and just run > it outside the cluster? > > > > Sorry if this was incoherent. I'm a bit tired; this system should be > > in production in two weeks, and unexpected problems (that didn't come > > up during testing) keep coming up... Any suggestions would be greatly > > appreciated. > > > > > > --Janne > > -- > > Janne Peltonen <janne.peltonen@xxxxxxxxxxx> > > > > -- > > Linux-cluster mailing list > > Linux-cluster@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > -- > > Linux-cluster mailing list > > Linux-cluster@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Janne Peltonen <janne.peltonen@xxxxxxxxxxx> > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Janne Peltonen <janne.peltonen@xxxxxxxxxxx> -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster