Re: pacemaker VIP routing latency to gluster node.

Soumya Koduri <skoduri@xxxxxxxxxx> · Sun, 25 Sep 2016 19:14:48 +0530

On 09/23/2016 10:44 PM, Dung Le wrote:
Hi Soumya,

Did you check 'pcs status' output that time? Maybe the *-ClusterIP*
resources would have gone to Stopped state, making VIPs unavailable.

Yes, I did check the ‘pcs status’ and everything was good at the time.

I just hit the issue again with VIP mounting and df output yesterday.

On the client 1, DF output was hung . I also could NOT mount the gluster
volume via VIP x.x.x.001, but I could mount the gluster volume via VIP
x.x.x.002 & x.x.x.003.
On the client 2, I could mount the gluster volume via VIP  x.x.x.001 &
 x.x.x.002 &  x.x.x.003.

So that means, only the traffic/the connection between client1 and VIP1 
is affected. One possibility which I can think of is that probably the 
outstanding requests from that client reached throttle limit (16) and 
server stopped processing further requests. Could you take tcpdump from 
the client and server and observe the traffic. Also please check netstat 
output on the server

# netstat -ntau | grep VIP1

Thanks,
Soumya

Since I did configure pacemaker VIP ip x.x.x.001 for SN1, so I went
ahead to stop pcs service on SN1 ‘pcs cluster stop’. The VIP ip
x.x.x.001 failover to SN2 as my configuration, afterward I could mount
the gluster volume via VIP’s IP x.x.x.001 on the client 1.

Any idea ??

Thanks,
~ Vic Le

On Sep 23, 2016, at 1:33 AM, Soumya Koduri <skoduri@xxxxxxxxxx
<mailto:skoduri@xxxxxxxxxx>> wrote:

On 09/23/2016 02:34 AM, Dung Le wrote:
Hello,

I have a pretty straight forward configuration as below:

3 storage nodes running version 3.7.11 with replica of 3 and it using
native gluster NFS.
corosync version 1.4.7 and pacemaker version 1.1.12
I have DNS round-robin on 3 VIPs living on the 3 storage nodes.

*_Here is how I configure my corosync:_*

SN1 with x.x.x.001
SN2 with x.x.x.002
SN3 with x.x.x.003

******************************************************************************************************************
*_Below is pcs config output:_*

Cluster Name: dfs_cluster
Corosync Nodes:
SN1 SN2 SN3
Pacemaker Nodes:
SN1 SN2 SN3

Resources:
Clone: Gluster-clone
 Meta Attrs: clone-max=3 clone-node-max=3 globally-unique=false
 Resource: Gluster (class=ocf provider=glusterfs type=glusterd)
  Operations: start interval=0s timeout=20 (Gluster-start-interval-0s)
              stop interval=0s timeout=20 (Gluster-stop-interval-0s)
              monitor interval=10s (Gluster-monitor-interval-10s)
Resource: SN1-ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
 Attributes: ip=x.x.x.001 cidr_netmask=32
 Operations: start interval=0s timeout=20s
(SN1-ClusterIP-start-interval-0s)
             stop interval=0s timeout=20s
(SN1-ClusterIP-stop-interval-0s)
             monitor interval=10s (SN1-ClusterIP-monitor-interval-10s)
Resource: SN2-ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
 Attributes: ip=x.x.x.002 cidr_netmask=32
 Operations: start interval=0s timeout=20s
(SN2-ClusterIP-start-interval-0s)
             stop interval=0s timeout=20s
(SN2-ClusterIP-stop-interval-0s)
             monitor interval=10s (SN2-ClusterIP-monitor-interval-10s)
Resource: SN3-ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
 Attributes: ip=x.x.x.003 cidr_netmask=32
 Operations: start interval=0s timeout=20s
(SN3-ClusterIP-start-interval-0s)
             stop interval=0s timeout=20s
(SN3-ClusterIP-stop-interval-0s)
             monitor interval=10s (SN3-ClusterIP-monitor-interval-10s)

Stonith Devices:
Fencing Levels:

Location Constraints:
 Resource: SN1-ClusterIP
   Enabled on: SN1 (score:3000) (id:location-SN1-ClusterIP-SN1-3000)
   Enabled on: SN2 (score:2000) (id:location-SN1-ClusterIP-SN2-2000)
   Enabled on: SN3 (score:1000) (id:location-SN1-ClusterIP-SN3-1000)
 Resource: SN2-ClusterIP
   Enabled on: SN2 (score:3000) (id:location-SN2-ClusterIP-SN2-3000)
   Enabled on: SN3 (score:2000) (id:location-SN2-ClusterIP-SN3-2000)
   Enabled on: SN1 (score:1000) (id:location-SN2-ClusterIP-SN1-1000)
 Resource: SN3-ClusterIP
   Enabled on: SN3 (score:3000) (id:location-SN3-ClusterIP-SN3-3000)
   Enabled on: SN1 (score:2000) (id:location-SN3-ClusterIP-SN1-2000)
   Enabled on: SN2 (score:1000) (id:location-SN3-ClusterIP-SN2-1000)
Ordering Constraints:
 start Gluster-clone then start SN1-ClusterIP (kind:Mandatory)
(id:order-Gluster-clone-SN1-ClusterIP-mandatory)
 start Gluster-clone then start SN2-ClusterIP (kind:Mandatory)
(id:order-Gluster-clone-SN2-ClusterIP-mandatory)
 start Gluster-clone then start SN3-ClusterIP (kind:Mandatory)
(id:order-Gluster-clone-SN3-ClusterIP-mandatory)
Colocation Constraints:

Resources Defaults:
is-managed: true
target-role: Started
requires: nothing
multiple-active: stop_nkart
Operations Defaults:
No defaults set

Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.11-97629de
no-quorum-policy: ignore
stonith-enabled: false

******************************************************************************************************************
*_pcs status output:_*

Cluster name: dfs_cluster
Last updated: Thu Sep 22 16:57:35 2016
Last change: Mon Aug 29 18:02:44 2016
Stack: cman
Current DC: SN1 - partition with quorum
Version: 1.1.11-97629de
3 Nodes configured
6 Resources configured

Online: [ SN1 SN2 SN3 ]

Full list of resources:

Clone Set: Gluster-clone [Gluster]
    Started: [ SN1 SN2 SN3 ]
SN1-ClusterIP(ocf::heartbeat:IPaddr2):Started SN1
SN2-ClusterIP(ocf::heartbeat:IPaddr2):Started SN2
SN3-ClusterIP(ocf::heartbeat:IPaddr2):Started SN3

******************************************************************************************************************

When I mount the gluster volume, I'm using the VIP name. It will choose
one of the storage nodes to establish NFS.

*_My issue is:_*
*_
_*
After mounted gluster volume for 1 - 2 hrs, all the clients are
reporting not getting df output as df got hung. I did check the dmessage
log from client side and getting the following error :

/Sep 20 05:46:45 xxxxx kernel: nfs: server nfsserver001 not responding,
still trying/
/Sep 20 05:49:45 xxxxx kernel: nfs: server nfsserver001 not responding,
still trying/

I did try to mount the gluster volume using the DNS round-robin to
different mountpoint but the mount process was not successful.

Did you check 'pcs status' output that time? Maybe the *-ClusterIP*
resources would have gone to Stopped state, making VIPs unavailable.

Thanks,
Soumya

Then I
tried to mount the gluster volume using storage node IP itself (not VIP
ip), and I was able to mount the gluster volume. Afterward, I flipped
all the clients to mount storage node IP directly and they have been up
for more than 12hrs without any issue.

Any idea what might cause this issue?

Thanks a lot,

~ Vic Le

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users