failover hangs on open file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



There seems to be a bug that affects service groups when a process outside the cluster?s control has open files on a file system that is managed via the cluster.  I am running the RHEL4U1 code release.  An example is defined below.
 
A simple 2 node cluster (nodeA and nodeB) with a Virtual IP resource and an ext3 filesystem resource managed via CLVMD.  I have removed a script resource for simplicity.  My service is started on nodeA, it has the VIP and ext3 mount (/mnt/cluster).  I can relocate the service to nodeB with no problem ?clusvcadm ?r service ?m nodeB?.  I can also relocate it back without a problem ? but  if I open a file on the cluster managed ext3 mount (vi /mnt/cluster/test) and try to migrate the service it fails every time.
 
The behavior of the RHEL3 codebase was to kill all processes associated with the mount on failure and/or relocation.
 
Here is the output from /var/log/messages during the relocation error:
 
Nov 29 12:22:12 nodeA clurgmgrd[8445]: <notice> Stopping service SERVICE1
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <notice> stop on fs "testfs" returned 2 (invalid argument(s))
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <crit> #12: RG SERVICE1 failed to stop; intervention required
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <notice> Service SERVICE1 is failed
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <warning> #70: Attempting to restart service SERVICE1 locally.
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <err> #43: Service SERVICE1 has failed; can not start.
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <alert> #2: Service SERVICE1 returned failure code.  Last Owner: nodeA
Nov 29 12:22:16 nodeA clurgmgrd[8445]: <alert> #4: Administrator intervention required.
 
Output of clustat after the relocation with open file:
 
Member Status: Quorate, Group Member
 
  Member Name                              State      ID
  ------ ----                              -----      --
  NodeB                                    Online     0x0000000000000002
 
  Service Name         Owner (Last)                   State
  ------- ----         ----- ------                   -----
  SERVICE1             (null)                         failed
 
Any ideas?
 
Thanks,
 Dan

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux