How to set up NFS HA service

birger <birger@xxxxxxxxx> · Tue, 19 Apr 2005 15:08:18 +0200

Debugging a cluster setup with this software could have been easier 
given better error messages from the components, but I'm getting there...

I thought I'd just mount my gfs file systems outside the resource 
manager's control to have them present all the time and just use the 
resource manager to move over the IP address and do the NFS magic. That 
seems impossible, as I couldn't get any exports to happen when I defined 
them in cluster.conf without a surrounding <fs>. I could define the 
exports in /etc/exports, but then I would have to synch files. So in the 
end I put all my gfs file systems into cluster.conf.

It almost works. I get mounts, and they get exported. But I have some 
error messages in the log file and the exports take a loooong time. Only 
2 of the 3 exports defined seem to show up.

I'm also a bit puzzled about why the file systems don't get unmounted 
when I disable all services.

As for file locking:

I copied /etc/init.d/nfslock to /etc/init.d/nfslock-svc and made some 
changes.

First, I added a little code to enable nfslock to read a variable 
STATD_STATEDIR for the -p option from the config file in /etc/sysconfig. 
I think this should get propagated back to upcoming fedora releases if 
someone who knows how would bother to do it... I then changed 
nfslock-svc to read a different config file (/etc/sysconfig/nfs-svc) and 
to do 'service nfslock stop' at the top of the start section and 
'service nfslock start' at the bottom of the stop section.

This enables me to have statd running as e.g. 'server1' on the cluster 
node until it takes over the nfs service. At takeover, statd gets 
restarted with statedir on a cluster file system (so it can take over 
lock info belonging to the service) and with the name of the NFS service 
IP address. Does this sound reasonable? I know I'll loose any locks the 
cluster node may have had (as NFS client) when it takes over the nfs 
service, but I cannot see any reason why the cluster node should have 
nfs locks (or nfs mounts for that matter) except when doing admin work. 
I think I could fix it by copying /var/lib/nfs/statd/sm* into the 
clustered file system right after the 'service nfslock stop' I put in.

I have appended part of my messages file and my cluster.conf file. Any 
help with my NFS export issues will be appreciated.

--
birger

<?xml version="1.0"?>
<cluster name="iftc001" config_version="20">
   <clusternodes>
      <clusternode name="server1">
         <fence>
            <!-- If all else fails, make someone do it manually -->
            <method name="human">
               <device name="last_resort" ipaddr="server1"/>
            </method>
         </fence>
      </clusternode>
      <clusternode name="server2">
         <fence>
            <!-- If all else fails, make someone do it manually -->
            <method name="human">
               <device name="last_resort" ipaddr="server2"/>
            </method>
         </fence>
      </clusternode>
   </clusternodes>

   <fencedevices>
      <fencedevice name="last_resort" agent="fence_manual"/>
   </fencedevices>

   <cman two_node="1" expected_votes="1">
   </cman>

<rm>
  <failoverdomains>
    <failoverdomain name="nfsdomain" ordered="0" restricted="1">
      <failoverdomainnode name="server1" priority="1"/>
      <failoverdomainnode name="server2" priority="2"/>
    </failoverdomain>
    <failoverdomain name="smbdomain" ordered="0" restricted="1">
      <failoverdomainnode name="server1" priority="2"/>
      <failoverdomainnode name="server2" priority="1"/>
    </failoverdomain>
  </failoverdomains>

  <resources>
    <clusterfs fstype="gfs" name="cluadmfs" mountpoint="/cluadm" device="/dev/raid5/cluadm" options="acl"/>
    <clusterfs fstype="gfs" name="pakkefs" mountpoint="/service/pakke" device="/dev/raid5/pakke" options="acl"/>
    <clusterfs fstype="gfs" name="xusersfs" mountpoint="/service/xusers" device="/dev/raid5/xusers" options="acl"/>
    <clusterfs fstype="gfs" name="iftscratchfs" mountpoint="/service/iftscratch" device="/dev/raid5/iftscratch" options="acl"/>
    <nfsexport name="NFSexports"/>
    <nfsclient name="nis-hosts" target="@nis-hosts" options="rw,sync"/>
    <nfsclient name="nis-hosts-ro" target="@nis-hosts" options="ro,sync"/>
  </resources>

  <service name="nfssvc" domain="nfsdomain">
    <ip address="X.X.X.X" monitor_link="yes"/>
    <script name="NFS script" file="/etc/init.d/nfs"/>
    <script name="NFS script" file="/etc/init.d/nfslock-svc"/>
    <clusterfs ref="cluadmfs"/>
    <clusterfs ref="pakkefs">
      <nfsexport ref="NFSexports">
        <nfsclient ref="nis-hosts-ro"/>
      </nfsexport>
    </clusterfs>
    <clusterfs ref="xusersfs">
      <nfsexport ref="NFSexports">
        <nfsclient ref="nis-hosts"/>
      </nfsexport>
    </clusterfs>
    <clusterfs ref="iftscratchfs">
      <nfsexport ref="NFSexports">
        <nfsclient ref="nis-hosts"/>
      </nfsexport>
    </clusterfs>
  </service>

  <service name="smbsvc" domain="smbdomain">
    <ip address="X.X.X.X" monitor_link="yes"/>
    <clusterfs ref="cluadmfs"/>
    <clusterfs ref="pakkefs"/>
    <clusterfs ref="xusersfs"/>
    <clusterfs ref="iftscratchfs"/>
  </service>
</rm>

</cluster>
Apr 19 14:42:43 server1 clurgmgrd[7498]: <notice> Starting disabled service nfssvc
Apr 19 14:42:43 server1 kernel: GFS: Trying to join cluster "lock_dlm", "iftc001:cluadm"
Apr 19 14:42:45 server1 kernel: GFS: fsid=iftc001:cluadm.0: Joined cluster. Now mounting FS...
Apr 19 14:42:45 server1 kernel: GFS: fsid=iftc001:cluadm.0: jid=0: Trying to acquire journal lock...
Apr 19 14:42:45 server1 kernel: GFS: fsid=iftc001:cluadm.0: jid=0: Looking at journal...
Apr 19 14:42:45 server1 kernel: GFS: fsid=iftc001:cluadm.0: jid=0: Done
Apr 19 14:42:45 server1 kernel: GFS: fsid=iftc001:cluadm.0: jid=1: Trying to acquire journal lock...
Apr 19 14:42:45 server1 kernel: GFS: fsid=iftc001:cluadm.0: jid=1: Looking at journal...
Apr 19 14:42:45 server1 kernel: GFS: fsid=iftc001:cluadm.0: jid=1: Done
Apr 19 14:42:46 server1 kernel: SELinux: initialized (dev dm-0, type gfs), not configured for labeling
Apr 19 14:42:46 server1 kernel: GFS: Trying to join cluster "lock_dlm", "iftc001:gfs01"
Apr 19 14:42:48 server1 kernel: GFS: fsid=iftc001:gfs01.0: Joined cluster. Now mounting FS...
Apr 19 14:42:48 server1 kernel: GFS: fsid=iftc001:gfs01.0: jid=0: Trying to acquire journal lock...
Apr 19 14:42:48 server1 kernel: GFS: fsid=iftc001:gfs01.0: jid=0: Looking at journal...
Apr 19 14:42:48 server1 kernel: GFS: fsid=iftc001:gfs01.0: jid=0: Done
Apr 19 14:42:48 server1 kernel: GFS: fsid=iftc001:gfs01.0: jid=1: Trying to acquire journal lock...
Apr 19 14:42:48 server1 kernel: GFS: fsid=iftc001:gfs01.0: jid=1: Looking at journal...
Apr 19 14:42:48 server1 kernel: GFS: fsid=iftc001:gfs01.0: jid=1: Done
Apr 19 14:42:48 server1 kernel: SELinux: initialized (dev dm-2, type gfs), not configured for labeling
Apr 19 14:42:48 server1 nfs: rpc.mountd shutdown failed
Apr 19 14:42:48 server1 nfs: nfsd shutdown failed
Apr 19 14:42:48 server1 nfs: rpc.rquotad shutdown failed
Apr 19 14:42:48 server1 nfs: Shutting down NFS services:  succeeded
Apr 19 14:42:48 server1 nfs: Starting NFS services:  succeeded
Apr 19 14:42:48 server1 nfs: rpc.rquotad startup succeeded
Apr 19 14:42:48 server1 nfs: rpc.nfsd startup succeeded
Apr 19 14:42:49 server1 nfs: rpc.mountd startup succeeded
Apr 19 14:42:49 server1 rpcidmapd: rpc.idmapd -SIGHUP succeeded
Apr 19 14:42:51 server1 clurmtabd[12327]: <err> #20: Failed set log level
Apr 19 14:42:51 server1 kernel: GFS: Trying to join cluster "lock_dlm", "iftc001:xusers"
Apr 19 14:42:53 server1 kernel: GFS: fsid=iftc001:xusers.0: Joined cluster. Now mounting FS...
Apr 19 14:42:53 server1 kernel: GFS: fsid=iftc001:xusers.0: jid=0: Trying to acquire journal lock...
Apr 19 14:42:53 server1 kernel: GFS: fsid=iftc001:xusers.0: jid=0: Looking at journal...
Apr 19 14:42:53 server1 kernel: GFS: fsid=iftc001:xusers.0: jid=0: Done
Apr 19 14:42:53 server1 kernel: GFS: fsid=iftc001:xusers.0: jid=1: Trying to acquire journal lock...
Apr 19 14:42:53 server1 kernel: GFS: fsid=iftc001:xusers.0: jid=1: Looking at journal...
Apr 19 14:42:53 server1 kernel: GFS: fsid=iftc001:xusers.0: jid=1: Done
Apr 19 14:42:53 server1 kernel: SELinux: initialized (dev dm-1, type gfs), not configured for labeling
Apr 19 14:42:53 server1 clurmtabd[12426]: <err> #20: Failed set log level
Apr 19 14:42:53 server1 kernel: GFS: Trying to join cluster "lock_dlm", "iftc001:scratch"
Apr 19 14:42:55 server1 kernel: GFS: fsid=iftc001:scratch.0: Joined cluster. Now mounting FS...
Apr 19 14:42:55 server1 kernel: GFS: fsid=iftc001:scratch.0: jid=0: Trying to acquire journal lock...
Apr 19 14:42:55 server1 kernel: GFS: fsid=iftc001:scratch.0: jid=0: Looking at journal...
Apr 19 14:42:56 server1 kernel: GFS: fsid=iftc001:scratch.0: jid=0: Done
Apr 19 14:42:56 server1 kernel: GFS: fsid=iftc001:scratch.0: jid=1: Trying to acquire journal lock...
Apr 19 14:42:56 server1 kernel: GFS: fsid=iftc001:scratch.0: jid=1: Looking at journal...
Apr 19 14:42:56 server1 kernel: GFS: fsid=iftc001:scratch.0: jid=1: Done
Apr 19 14:42:56 server1 kernel: SELinux: initialized (dev dm-3, type gfs), not configured for labeling
Apr 19 14:42:56 server1 clurmtabd[12517]: <err> #20: Failed set log level
Apr 19 14:42:57 server1 nfs: Starting NFS services:  succeeded
Apr 19 14:42:57 server1 nfs: rpc.rquotad startup succeeded
Apr 19 14:42:57 server1 nfs: rpc.nfsd startup succeeded
Apr 19 14:42:57 server1 nfs: rpc.mountd startup succeeded
Apr 19 14:42:57 server1 rpcidmapd: rpc.idmapd -SIGHUP succeeded
Apr 19 14:42:57 server1 nfslock: lockd -KILL succeeded
Apr 19 14:42:57 server1 rpc.statd[12004]: Caught signal 15, un-registering and exiting.
Apr 19 14:42:57 server1 nfslock: rpc.statd shutdown succeeded
Apr 19 14:42:58 server1 rpc.statd[12618]: Version 1.0.6 Starting
Apr 19 14:42:58 server1 rpc.statd[12618]: Flags:
Apr 19 14:42:58 server1 nfslock-svc: rpc.statd startup succeeded
Apr 19 14:42:58 server1 clurgmgrd[7498]: <notice> Service nfssvc started
Apr 19 14:43:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts-ro" returned 1 (generic error)
Apr 19 14:43:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts" returned 1 (generic error)
Apr 19 14:44:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts-ro" returned 1 (generic error)
Apr 19 14:44:56 server1 clurgmgrd[7498]: <notice> status on nfsclient "nis-hosts" returned 1 (generic error)

--

Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster