Re: RHCS TestCluster with ScientificLinux 5.2

Rainer Schwierz <R.Schwierz@xxxxxxxxxxxxxxxxxxxx> · Thu, 17 Sep 2009 07:30:09 +0200

Hello,

hmm, meanwhile the fence_apc problem is fixed by a more recent version 
of fence_apc.

But the nfs lock problem is still open. Does it mean I definitely 
should not use ScientificLinux and switch to Fedora 11 or RHEL5.4?

Cheers, Rainer

Rainer Schwierz wrote:
Hello experts,

In preparation of a new production system I have setup a testsystem
with RHCS under ScientificLinux 5.2.
It consists of two identical nodes FSC/RX200, a Brocade FibreChannel 
switch, a FSC/SX80 FibreChannel-raidarray, and a APC-powerswitch.
The configuration is attached at the end.
I want to have three (GFS) filesystems
- exported via nfs to a number of clients, each service has its own IP
- backup the filesystems via TSM to a TSM-server

I see some problems I need an explanation/solution:
1) if I connect the nfs-clients to the IP of the configured nfs-service
  started e.g. on tnode02, the filesystem is mounted, but I see a
  strange lock problem
    tnode02 kernel: portmap: server "client-IP" not\
        responding, timed out
    tnode02 kernel: lockd: server "client-IP" not responding,\
        timed out
   It goes away, if I bind the nfs-clients direct to the IP of the
   the node tnode02. If I start the services on tnode01, it is exactly
   the same problem, solved by binding the clients direct to tnode01. It
   does not depend on firewall configuration, it is the same if I switch
   off iptables on both tnode0[12] and clients.

2) tnode02 died with kernel-panic; no real helpfull logs found regarding
   the panic, I only see a lot of messages regarding problems nfs
   locking over gfs :

  kernel: lockd: grant for unknown block
  kernel: dlm: dlm_plock_callback: lock granted after lock request failed

  before the kernel paniced, but is this a real reason to panic?

  At this point tnod01 tried to take over the cluster and to fence
  tnode02, which gave an error, I do not understand, because fence_apc
  runnig by hand (On, Off, Status) is properly working

tnode01 fenced[3127]: fencing node "tnode02.phy.tu-dresden.de"
tnode01 fenced[3127]: agent "fence_apc" reports: Traceback (most recent 
call last):   File "/sbin/fence_apc", line 829, in ?     main()   File 
"/sbin/fence_apc", line 349, in main     do_power_off(sock)   File 
"/sbin/fence_apc", line 813, in do_power_off     x = 
do_power_switch(sock, "off")   File "/sbi
tnode01 fenced[3127]: agent "fence_apc" reports: n/fence_apc", line 611, 
in do_power_switch     result_code, response = power_off(txt + ndbuf) 
File "/sbin/fence_apc", line 817, in power_off     x = 
power_switch(buffer, False, "2", "3");   File "/sbin/fence_apc", line 
810, in power_switch     raise "un
tnode01 fenced[3127]: agent "fence_apc" reports: known screen 
encountered in \n" + str(lines) + "\n" unknown screen encountered in 
['', '> 2', '', '', '------- Configure Outlet 
------------------------------------------------------', '', '    # 
State  Ph  Name                     Pwr On Dly  Pwr Off D
tnode01 fenced[3127]: agent "fence_apc" reports: ly  Reboot Dur.', ' 
----------------------------------------------------------------------------', 
'    2  ON     1   Outlet 2                 0 sec       0 sec        5 
sec', '', '     1- Outlet Name         : Outlet 2', '     2- Power On 
Delay(sec) :
tnode01 fenced[3127]: agent "fence_apc" reports: 0', '     3- Power Off 
Delay(sec): 0', '     4- Reboot Duration(sec): 5', '     5- Accept 
Changes      : ', '', '     ?- Help, <ESC>- Back, <ENTER>- Refresh, 
<CTRL-L>- Event Log']

  So tnode01 did not stop fencing tnod02 and so it was not able to take
  over the cluster services. Via system-config-cluster one was also not
  able to stop any service. Stopping processes did not really help. The
  only solution at this point was to power down both nodes and restart
  the cluster.

so my questions:

Is there a solution for the locking problem if one bind the nfs clients 
to the configured nfs service IP ?

Is there an explanation/solution of the nfs (dlm) GFS locking problem ?

Is there a signifivant update to fence_apc I have missed ?

Why do I have to configure the GFS resources with the "force umount" 
option?
  I was under the impression that one can mount GFS filesystems
  simultanously on a number of nodes. If I define the GFS resources
  without "force umount", the filesystem is not mounted at all. But
  running the defined TSM service depends on all mounted filesystems.

Thanks for any help,  Rainer

The configuration is
Scientific Linux SL release 5.2 (Boron)
kernel 2.6.18-128.4.1.el5 #1 SMP Tue Aug 4 12:51:10 EDT 2009 x86_64 
x86_64 x86_64 GNU/Linux
device-mapper-multipath-0.4.7-23.el5_3.2.x86_64
rgmanager-2.0.38-2.el5_2.1.x86_64
system-config-cluster-1.0.52-1.1.noarch
cman-2.0.84-2.el5.x86_64
kmod-gfs-0.1.23-5.el5_2.4.x86_64
gfs2-utils-0.1.44-1.el5.x86_64
gfs-utils-0.1.17-1.el5.x86_64
lvm2-cluster-2.02.32-4.el5.x86_64
modcluster-0.12.0-7.el5.x86_64
ricci-0.12.0-7.el5.x86_64
openais-0.80.3-15.el5.x86_64

cluster.conf
<?xml version="1.0"?>
<cluster alias="tstw_HA2" config_version="115" name="tstw_HA2">
        <fence_daemon clean_start="0" post_fail_delay="0" 
post_join_delay="3"/>
        <clusternodes>
                <clusternode name="tnode02.tst.tu-dresden.de" nodeid="1" 
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="HA_APC" port="2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="tnode01.tst.tu-dresden.de" nodeid="2" 
votes="1">
                        <fence>
                                <method name="1">
                                        <device name="HA_APC" port="1"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr="192.168.0.10" 
login="xxx" name="HA_APC" passwd="yy-xxxx"/>
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="HA_new_failover" 
ordered="1" restricted="1">
                                <failoverdomainnode 
name="tnode01.tst.tu-dresden.de" priority="1"/>
                                <failoverdomainnode 
name="tnode02.tst.tu-dresden.de" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <clusterfs device="/dev/VG1/LV00" 
force_unmount="1" fsid="53422" fstype="gfs" mountpoint="/global_home" 
name="home_GFS" options=""/>
                        <nfsexport name="home_nfsexport"/>
                        <nfsclient name="tstw_home" 
options="rw,root_squash" path="/global_home" 
target="tstw*.tst.tu-dresden.de"/>
                        <ip address="111.22.33.32" monitor_link="1"/>
                        <ip address="192.168.20.30" monitor_link="1"/>
                        <nfsclient name="fast_nfs_home_clients" 
options="rw,root_squash" path="/global_home" target="192.168.20.0/24"/>
                        <nfsexport name="cluster_nfsexport"/>
                        <nfsclient name="tstw_cluster" 
options="no_root_squash,ro" path="/global_cluster" 
target="tstw*.tst.tu-dresden.de"/>
                        <nfsclient name="fast_nfs_cluster_clients" 
options="no_root_squash,ro" path="/global_cluster" 
target="192.168.20.0/24"/>
                        <script file="/etc/rc.d/init.d/tsm" 
name="TSM_backup"/>
                        <clusterfs device="/dev/VG1/LV10" 
force_unmount="1" fsid="192" fstype="gfs" mountpoint="/global_cluster" 
name="cluster_GFS" options=""/>
                        <clusterfs device="/dev/VG1/LV20" 
force_unmount="1" fsid="63016" fstype="gfs" mountpoint="/global_soft" 
name="software_GFS" options=""/>
                        <nfsexport name="soft_nfsexport"/>
                        <nfsclient name="tstw_soft" 
options="rw,root_squash" path="/global_soft" 
target="tstw*.tst.tu-dresden.de"/>
                        <nfsclient name="fast_nfs_soft_clients" 
options="rw,root_squash" path="/global_soft" target="192.168.20.0/24"/>
                        <nfsclient name="tsts_home" 
options="no_root_squash,rw" path="/global_home" 
target="tsts0*.tst.tu-dresden.de"/>
                        <nfsclient name="tsts_cluster" 
options="rw,root_squash" path="/global_cluster" 
target="tsts0*.tst.tu-dresden.de"/>
                        <nfsclient name="tsts_soft" 
options="rw,root_squash" path="/global_soft" 
target="tsts0*.tst.tu-dresden.de"/>
                        <nfsclient name="tstf_home" 
options="rw,root_squash" path="/global_home" 
target="tstf*.tst.tu-dresden.de"/>
                        <nfsclient name="tstf_cluster" 
options="rw,root_squash" path="/global_cluster" 
target="tstf*.tst.tu-dresden.de"/>
                        <nfsclient name="tstf_soft" 
options="rw,root_squash" path="/global_soft" 
target="tstf*.tst.tu-dresden.de"/>
                        <ip address="111.22.33.31" monitor_link="1"/>
                        <ip address="111.22.33.30" monitor_link="1"/>
                        <ip address="192.168.20.31" monitor_link="1"/>
                        <ip address="192.168.20.32" monitor_link="1"/>
                        <clusterfs device="/dev/VG1/LV20" 
force_unmount="0" fsid="11728" fstype="gfs" mountpoint="/global_soft" 
name="Software_GFS" options=""/>
                        <clusterfs device="/dev/VG1/LV10" 
force_unmount="0" fsid="36631" fstype="gfs" mountpoint="/global_cluster" 
name="Cluster_GFS" options=""/>
                        <clusterfs device="/dev/VG1/LV00" 
force_unmount="0" fsid="45816" fstype="gfs" mountpoint="/global_home" 
name="Home_GFS" options=""/>
                </resources>
                <service autostart="1" domain="HA_new_failover" 
name="service_nfs_home">
                        <nfsexport ref="home_nfsexport"/>
                        <nfsclient ref="tstw_home"/>
                        <ip ref="111.22.33.32"/>
                        <nfsclient ref="tsts_home"/>
                        <nfsclient ref="tstf_home"/>
                        <clusterfs ref="home_GFS"/>
                </service>
                <service autostart="1" domain="HA_new_failover" 
name="service_nfs_home_fast">
                        <nfsexport ref="home_nfsexport"/>
                        <nfsclient ref="fast_nfs_home_clients"/>
                        <ip ref="192.168.20.32"/>
                        <clusterfs ref="Home_GFS"/>
                </service>
                <service autostart="1" domain="HA_new_failover" 
name="service_nfs_cluster">
                        <nfsexport ref="cluster_nfsexport"/>
                        <nfsclient ref="tstw_cluster"/>
                        <nfsclient ref="tsts_cluster"/>
                        <nfsclient ref="tstf_cluster"/>
                        <ip ref="111.22.33.30"/>
                        <clusterfs ref="cluster_GFS"/>
                </service>
                <service autostart="1" name="service_nfs_cluster_fast">
                        <nfsexport ref="cluster_nfsexport"/>
                        <ip ref="192.168.20.30"/>
                        <nfsclient ref="fast_nfs_cluster_clients"/>
                        <clusterfs ref="Cluster_GFS"/>
                </service>
                <service autostart="1" domain="HA_new_failover" 
name="service_TSM">
                        <ip ref="111.22.33.31"/>
                        <script ref="TSM_backup"/>
                        <clusterfs ref="Software_GFS"/>
                        <clusterfs ref="Cluster_GFS"/>
                        <clusterfs ref="Home_GFS"/>
                </service>
                <service autostart="1" domain="HA_new_failover" 
name="service_nfs_soft">
                        <nfsexport ref="soft_nfsexport"/>
                        <nfsclient ref="tstw_soft"/>
                        <nfsclient ref="tsts_soft"/>
                        <nfsclient ref="tstf_soft"/>
                        <ip ref="111.22.33.31"/>
                        <clusterfs ref="software_GFS"/>
                </service>
                <service autostart="1" domain="HA_new_failover" 
name="service_nfs_soft_fast">
                        <nfsexport ref="soft_nfsexport"/>
                        <nfsclient ref="fast_nfs_soft_clients"/>
                        <ip ref="192.168.20.31"/>
                        <clusterfs ref="Software_GFS"/>
                </service>
        </rm>
</cluster>

--
| R.Schwierz@xxxxxxxxxxxxxxxxxxxx                     |
| Rainer  Schwierz, Inst. f. Kern- und Teilchenphysik |
| TU Dresden,       D-01062 Dresden                   |
| Tel. ++49 351 463 32957    FAX ++49 351 463 37292   |
| http://iktp.tu-dresden.de/~schwierz/                |

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster