Hello experts,
In preparation of a new production system I have setup a testsystem
with RHCS under ScientificLinux 5.2.
It consists of two identical nodes FSC/RX200, a Brocade FibreChannel
switch, a FSC/SX80 FibreChannel-raidarray, and a APC-powerswitch.
The configuration is attached at the end.
I want to have three (GFS) filesystems
- exported via nfs to a number of clients, each service has its own IP
- backup the filesystems via TSM to a TSM-server
I see some problems I need an explanation/solution:
1) if I connect the nfs-clients to the IP of the configured nfs-service
started e.g. on tnode02, the filesystem is mounted, but I see a
strange lock problem
tnode02 kernel: portmap: server "client-IP" not\
responding, timed out
tnode02 kernel: lockd: server "client-IP" not responding,\
timed out
It goes away, if I bind the nfs-clients direct to the IP of the
the node tnode02. If I start the services on tnode01, it is exactly
the same problem, solved by binding the clients direct to tnode01. It
does not depend on firewall configuration, it is the same if I switch
off iptables on both tnode0[12] and clients.
2) tnode02 died with kernel-panic; no real helpfull logs found regarding
the panic, I only see a lot of messages regarding problems nfs
locking over gfs :
kernel: lockd: grant for unknown block
kernel: dlm: dlm_plock_callback: lock granted after lock request failed
before the kernel paniced, but is this a real reason to panic?
At this point tnod01 tried to take over the cluster and to fence
tnode02, which gave an error, I do not understand, because fence_apc
runnig by hand (On, Off, Status) is properly working
tnode01 fenced[3127]: fencing node "tnode02.phy.tu-dresden.de"
tnode01 fenced[3127]: agent "fence_apc" reports: Traceback (most recent
call last): File "/sbin/fence_apc", line 829, in ? main() File
"/sbin/fence_apc", line 349, in main do_power_off(sock) File
"/sbin/fence_apc", line 813, in do_power_off x =
do_power_switch(sock, "off") File "/sbi
tnode01 fenced[3127]: agent "fence_apc" reports: n/fence_apc", line 611,
in do_power_switch result_code, response = power_off(txt + ndbuf)
File "/sbin/fence_apc", line 817, in power_off x =
power_switch(buffer, False, "2", "3"); File "/sbin/fence_apc", line
810, in power_switch raise "un
tnode01 fenced[3127]: agent "fence_apc" reports: known screen
encountered in \n" + str(lines) + "\n" unknown screen encountered in
['', '> 2', '', '', '------- Configure Outlet
------------------------------------------------------', '', ' #
State Ph Name Pwr On Dly Pwr Off D
tnode01 fenced[3127]: agent "fence_apc" reports: ly Reboot Dur.', '
----------------------------------------------------------------------------',
' 2 ON 1 Outlet 2 0 sec 0 sec 5
sec', '', ' 1- Outlet Name : Outlet 2', ' 2- Power On
Delay(sec) :
tnode01 fenced[3127]: agent "fence_apc" reports: 0', ' 3- Power Off
Delay(sec): 0', ' 4- Reboot Duration(sec): 5', ' 5- Accept
Changes : ', '', ' ?- Help, <ESC>- Back, <ENTER>- Refresh,
<CTRL-L>- Event Log']
So tnode01 did not stop fencing tnod02 and so it was not able to take
over the cluster services. Via system-config-cluster one was also not
able to stop any service. Stopping processes did not really help. The
only solution at this point was to power down both nodes and restart
the cluster.
so my questions:
Is there a solution for the locking problem if one bind the nfs clients
to the configured nfs service IP ?
Is there an explanation/solution of the nfs (dlm) GFS locking problem ?
Is there a signifivant update to fence_apc I have missed ?
Why do I have to configure the GFS resources with the "force umount"
option?
I was under the impression that one can mount GFS filesystems
simultanously on a number of nodes. If I define the GFS resources
without "force umount", the filesystem is not mounted at all. But
running the defined TSM service depends on all mounted filesystems.
Thanks for any help, Rainer
The configuration is
Scientific Linux SL release 5.2 (Boron)
kernel 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 12:51:10 EDT 2009 x86_64
x86_64 x86_64 GNU/Linux
device-mapper-multipath-0.4.7-23.el5_3.2.x86_64
rgmanager-2.0.38-2.el5_2.1.x86_64
system-config-cluster-1.0.52-1.1.noarch
cman-2.0.84-2.el5.x86_64
kmod-gfs-0.1.23-5.el5_2.4.x86_64
gfs2-utils-0.1.44-1.el5.x86_64
gfs-utils-0.1.17-1.el5.x86_64
lvm2-cluster-2.02.32-4.el5.x86_64
modcluster-0.12.0-7.el5.x86_64
ricci-0.12.0-7.el5.x86_64
openais-0.80.3-15.el5.x86_64
cluster.conf
<?xml version="1.0"?>
<cluster alias="tstw_HA2" config_version="115" name="tstw_HA2">
<fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="3"/>
<clusternodes>
<clusternode name="tnode02.tst.tu-dresden.de" nodeid="1"
votes="1">
<fence>
<method name="1">
<device name="HA_APC" port="2"/>
</method>
</fence>
</clusternode>
<clusternode name="tnode01.tst.tu-dresden.de" nodeid="2"
votes="1">
<fence>
<method name="1">
<device name="HA_APC" port="1"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_apc" ipaddr="192.168.0.10"
login="xxx" name="HA_APC" passwd="yy-xxxx"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="HA_new_failover"
ordered="1" restricted="1">
<failoverdomainnode
name="tnode01.tst.tu-dresden.de" priority="1"/>
<failoverdomainnode
name="tnode02.tst.tu-dresden.de" priority="2"/>
</failoverdomain>
</failoverdomains>
<resources>
<clusterfs device="/dev/VG1/LV00"
force_unmount="1" fsid="53422" fstype="gfs" mountpoint="/global_home"
name="home_GFS" options=""/>
<nfsexport name="home_nfsexport"/>
<nfsclient name="tstw_home"
options="rw,root_squash" path="/global_home"
target="tstw*.tst.tu-dresden.de"/>
<ip address="111.22.33.32" monitor_link="1"/>
<ip address="192.168.20.30" monitor_link="1"/>
<nfsclient name="fast_nfs_home_clients"
options="rw,root_squash" path="/global_home" target="192.168.20.0/24"/>
<nfsexport name="cluster_nfsexport"/>
<nfsclient name="tstw_cluster"
options="no_root_squash,ro" path="/global_cluster"
target="tstw*.tst.tu-dresden.de"/>
<nfsclient name="fast_nfs_cluster_clients"
options="no_root_squash,ro" path="/global_cluster"
target="192.168.20.0/24"/>
<script file="/etc/rc.d/init.d/tsm"
name="TSM_backup"/>
<clusterfs device="/dev/VG1/LV10"
force_unmount="1" fsid="192" fstype="gfs" mountpoint="/global_cluster"
name="cluster_GFS" options=""/>
<clusterfs device="/dev/VG1/LV20"
force_unmount="1" fsid="63016" fstype="gfs" mountpoint="/global_soft"
name="software_GFS" options=""/>
<nfsexport name="soft_nfsexport"/>
<nfsclient name="tstw_soft"
options="rw,root_squash" path="/global_soft"
target="tstw*.tst.tu-dresden.de"/>
<nfsclient name="fast_nfs_soft_clients"
options="rw,root_squash" path="/global_soft" target="192.168.20.0/24"/>
<nfsclient name="tsts_home"
options="no_root_squash,rw" path="/global_home"
target="tsts0*.tst.tu-dresden.de"/>
<nfsclient name="tsts_cluster"
options="rw,root_squash" path="/global_cluster"
target="tsts0*.tst.tu-dresden.de"/>
<nfsclient name="tsts_soft"
options="rw,root_squash" path="/global_soft"
target="tsts0*.tst.tu-dresden.de"/>
<nfsclient name="tstf_home"
options="rw,root_squash" path="/global_home"
target="tstf*.tst.tu-dresden.de"/>
<nfsclient name="tstf_cluster"
options="rw,root_squash" path="/global_cluster"
target="tstf*.tst.tu-dresden.de"/>
<nfsclient name="tstf_soft"
options="rw,root_squash" path="/global_soft"
target="tstf*.tst.tu-dresden.de"/>
<ip address="111.22.33.31" monitor_link="1"/>
<ip address="111.22.33.30" monitor_link="1"/>
<ip address="192.168.20.31" monitor_link="1"/>
<ip address="192.168.20.32" monitor_link="1"/>
<clusterfs device="/dev/VG1/LV20"
force_unmount="0" fsid="11728" fstype="gfs" mountpoint="/global_soft"
name="Software_GFS" options=""/>
<clusterfs device="/dev/VG1/LV10"
force_unmount="0" fsid="36631" fstype="gfs" mountpoint="/global_cluster"
name="Cluster_GFS" options=""/>
<clusterfs device="/dev/VG1/LV00"
force_unmount="0" fsid="45816" fstype="gfs" mountpoint="/global_home"
name="Home_GFS" options=""/>
</resources>
<service autostart="1" domain="HA_new_failover"
name="service_nfs_home">
<nfsexport ref="home_nfsexport"/>
<nfsclient ref="tstw_home"/>
<ip ref="111.22.33.32"/>
<nfsclient ref="tsts_home"/>
<nfsclient ref="tstf_home"/>
<clusterfs ref="home_GFS"/>
</service>
<service autostart="1" domain="HA_new_failover"
name="service_nfs_home_fast">
<nfsexport ref="home_nfsexport"/>
<nfsclient ref="fast_nfs_home_clients"/>
<ip ref="192.168.20.32"/>
<clusterfs ref="Home_GFS"/>
</service>
<service autostart="1" domain="HA_new_failover"
name="service_nfs_cluster">
<nfsexport ref="cluster_nfsexport"/>
<nfsclient ref="tstw_cluster"/>
<nfsclient ref="tsts_cluster"/>
<nfsclient ref="tstf_cluster"/>
<ip ref="111.22.33.30"/>
<clusterfs ref="cluster_GFS"/>
</service>
<service autostart="1" name="service_nfs_cluster_fast">
<nfsexport ref="cluster_nfsexport"/>
<ip ref="192.168.20.30"/>
<nfsclient ref="fast_nfs_cluster_clients"/>
<clusterfs ref="Cluster_GFS"/>
</service>
<service autostart="1" domain="HA_new_failover"
name="service_TSM">
<ip ref="111.22.33.31"/>
<script ref="TSM_backup"/>
<clusterfs ref="Software_GFS"/>
<clusterfs ref="Cluster_GFS"/>
<clusterfs ref="Home_GFS"/>
</service>
<service autostart="1" domain="HA_new_failover"
name="service_nfs_soft">
<nfsexport ref="soft_nfsexport"/>
<nfsclient ref="tstw_soft"/>
<nfsclient ref="tsts_soft"/>
<nfsclient ref="tstf_soft"/>
<ip ref="111.22.33.31"/>
<clusterfs ref="software_GFS"/>
</service>
<service autostart="1" domain="HA_new_failover"
name="service_nfs_soft_fast">
<nfsexport ref="soft_nfsexport"/>
<nfsclient ref="fast_nfs_soft_clients"/>
<ip ref="192.168.20.31"/>
<clusterfs ref="Software_GFS"/>
</service>
</rm>
</cluster>