RHEL4 - loss of iscsi connectivity causes rgmanager crash

Charles Riley <criley@xxxxxxxx> · Mon, 20 Apr 2009 15:40:33 -0400

Hi all,

With the "iscsi doubt" thread in mind, I thought I'd share an experience
I've had twice now with
iscsi and RHEL 4 cluster manager.

What happens is that an iscsi filesystem which is part of a resource
group will become unavailable
(in dmesg, you see iscsid lose connection, then attempt to reconnect
over and over). 
However, rgmanager does not seem to detect that the filesystem has
disappeared, even though the
filesystem is configured in the resource group using the built in "fs"
resource agent.
When I try to fail the resource group over to another node, rgmanager
gets all out of whack and starts
reporting bogus information.  During the most recent failure,  rgmanager
crashed on all but two of six
total nodes.  On the two nodes where it was still running, resource
groups showed as starting, stopping,
or running on nodes that I'd manually fenced five minutes before.  I
ended up rebooting all of the servers
and bringing them up clean.

I also found that the rest of the resource group will start even if
iscsid is not running.
Which is really weird since all of the rest of the resource group are
attached to the iscsi filesystem.
e.g.  all of the other resource agents/scripts are nested/indented
within the "fs" block in cluster.conf.
If I understand correctly, that shouldn't be able to happen.

I'm going to try a few things:
Setting "Continuous=no" for all the iscsi targets in iscsi.conf
(disables continuous discovery)
Setting self_fence=1 in cluster.conf
Setting the recovery policy to "relocate"

Any recommendations from the experts?

I have a support ticket open with Redhat, but they are still combing
through six nodes worth of sosreport files.

Cheers

-- 
Charles Riley

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster