failover hangs on open file

danwest@xxxxxxxxxxx · Wed, 30 Nov 2005 15:39:44 +0000

There seems to be a bug that affects service groups when a process outside the cluster?s control has open files on a file system that is managed via the cluster.  I am running the RHEL4U1 code release.  An example is defined below.

A simple 2 node cluster (nodeA and nodeB) with a Virtual IP resource and an ext3 filesystem resource managed via CLVMD.  I have removed a script resource for simplicity.  My service is started on nodeA, it has the VIP and ext3 mount (/mnt/cluster).  I can relocate the service to nodeB with no problem ?clusvcadm ?r service ?m nodeB?.  I can also relocate it back without a problem ? but  if I open a file on the cluster managed ext3 mount (vi /mnt/cluster/test) and try to migrate the service it fails every time.

The behavior of the RHEL3 codebase was to kill all processes associated with the mount on failure and/or relocation.

Here is the output from /var/log/messages during the relocation error:

Nov 29 12:22:12 nodeA clurgmgrd[8445]: <notice> Stopping service SERVICE1
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <notice> stop on fs "testfs" returned 2 (invalid argument(s))
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <crit> #12: RG SERVICE1 failed to stop; intervention required
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <notice> Service SERVICE1 is failed
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <warning> #70: Attempting to restart service SERVICE1 locally.
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <err> #43: Service SERVICE1 has failed; can not start.
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <alert> #2: Service SERVICE1 returned failure code.  Last Owner: nodeA
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <alert> #4: Administrator intervention required.

Output of clustat after the relocation with open file:

Member Status: Quorate, Group Member

  Member Name                              State      ID
  ------ ----                              -----      --
  NodeB                                    Online     0x0000000000000002

  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  SERVICE1             (null)                         failed

Any ideas?

Thanks,
 Dan

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster