Re: BUG: After stop and start wrong port is advertised

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We've a fix in release-3.10 branch which is merged and should be available in the next 3.10 update.

On Wed, Nov 8, 2017 at 4:58 PM, Mike Hulsman <mike.hulsman@xxxxxxxx> wrote:
Hi,

This bug is hitting me hard on two different clients.
In RHGS 3.3 and on glusterfs 3.10.2 on Centos 7.4
in once case I had 59 differences in a total of 203 bricks.

I wrote a quick and dirty script to check all ports against the brick file and the running process.
#!/bin/bash

Host=`uname -n| awk -F"." '{print $1}'`
GlusterVol=`ps -eaf | grep /usr/sbin/glusterfsd| grep -v grep | awk '{print $NF}'| awk -F"-server" '{print $1}'|sort | uniq`
Port=`ps -eaf | grep /usr/sbin/glusterfsd| grep -v grep | awk '{print $NF}'| awk -F"." '{print $NF}'`

for Volumes in ${GlusterVol};
do
cd /var/lib/glusterd/vols/${Volumes}/bricks
Bricks=`ls ${Host}*`
for Brick in ${Bricks};
do
_Onfile_=`grep ^listen-port "${Brick}"`
BrickDir=`echo "${Brick}"| awk -F":" '{print $2}'| cut -c2-`
Daemon=`ps -eaf | grep "\${BrickDir}.pid" |grep -v grep | awk '{print $NF}' | awk -F"." '{print $2}'`
#echo Onfile: ${Onfile}
#echo Daemon: ${Daemon}
if [ "${Onfile}" = "${Daemon}" ]; then
echo "OK For ${Brick}"
else
echo "!!! NOT OK For ${Brick}"
fi
done
done


Met vriendelijke groet,

Mike Hulsman

Proxy Managed Services B.V. | www.proxy.nl | Enterprise IT-Infra, Open Source and Cloud Technology
Delftweg 128 3043 NB Rotterdam The Netherlands | +31 10 307 0965


From: "Jo Goossens" <jo.goossens@xxxxxxxxxxxxxxxx>
To: "Atin Mukherjee" <amukherj@xxxxxxxxxx>
Cc: gluster-users@xxxxxxxxxxx
Sent: Friday, October 27, 2017 11:06:35 PM
Subject: Re: [Gluster-users] BUG: After stop and start wrong port is advertised
RE: BUG: After stop and start wrong port is advertised

Hello Atin,

 

 

I just read it and very happy you found the issue. We really hope this will be fixed in the next 3.10.7 version!

 

 

PS: Wow nice all that c code and those "goto out" statements (not always considered clean but the best way often I think). Can remember the days I wrote kernel drivers myself in c :)

 

 

Regards

Jo Goossens

 

 


 

-----Original message-----
From: Atin Mukherjee <amukherj@xxxxxxxxxx>
Sent: Fri 27-10-2017 21:01
Subject: Re: BUG: After stop and start wrong port is advertised
To: Jo Goossens <jo.goossens@xxxxxxxxxxxxxxxx>;
CC: gluster-users@xxxxxxxxxxx;
We (finally) figured out the root cause, Jo!
 
Patch https://review.gluster.org/#/c/18579 posted upstream for review.

On Thu, Sep 21, 2017 at 2:08 PM, Jo Goossens <jo.goossens@xxxxxxxxxxxxxxxx> wrote:

Hi,

 

 

We use glusterfs 3.10.5 on Debian 9.

 

When we stop or restart the service, e.g.: service glusterfs-server restart

 

We see that the wrong port get's advertised afterwards. For example:

 

Before restart:

 

Status of volume: public
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.140.41:/gluster/public        49153     0          Y       6364
Brick 192.168.140.42:/gluster/public        49152     0          Y       1483
Brick 192.168.140.43:/gluster/public        49152     0          Y       5913
Self-heal Daemon on localhost               N/A       N/A        Y       5932
Self-heal Daemon on 192.168.140.42          N/A       N/A        Y       13084
Self-heal Daemon on 192.168.140.41          N/A       N/A        Y       15499
 
Task Status of Volume public
------------------------------------------------------------------------------
There are no active volume tasks
 
 
After restart of the service on one of the nodes (192.168.140.43) the port seems to have changed (but it didn't):
 
root@app3:/var/log/glusterfs#  gluster volume status
Status of volume: public
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 192.168.140.41:/gluster/public        49153     0          Y       6364
Brick 192.168.140.42:/gluster/public        49152     0          Y       1483
Brick 192.168.140.43:/gluster/public        49154     0          Y       5913
Self-heal Daemon on localhost               N/A       N/A        Y       4628
Self-heal Daemon on 192.168.140.42          N/A       N/A        Y       3077
Self-heal Daemon on 192.168.140.41          N/A       N/A        Y       28777
 
Task Status of Volume public
------------------------------------------------------------------------------
There are no active volume tasks
 
 
However the active process is STILL the same pid AND still listening on the old port
 
root@192.168.140.43:/var/log/glusterfs# netstat -tapn | grep gluster
tcp        0      0 0.0.0.0:49152           0.0.0.0:*               LISTEN      5913/glusterfsd
 
 
The other nodes logs fill up with errors because they can't reach the daemon anymore. They try to reach it on the "new" port instead of the old one:
 
[2017-09-21 08:33:25.225006] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
[2017-09-21 08:33:29.226633] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:29.227490] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
[2017-09-21 08:33:33.225849] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:33.236395] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
[2017-09-21 08:33:37.225095] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:37.225628] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
[2017-09-21 08:33:41.225805] I [rpc-clnt.c:2000:rpc_clnt_reconfig] 0-public-client-2: changing port to 49154 (from 0)
[2017-09-21 08:33:41.226440] E [socket.c:2327:socket_connect_finish] 0-public-client-2: connection to 192.168.140.43:49154 failed (Connection refused); disconnecting socket
 
So they now try 49154 instead of the old 49152 
 
Is this also by design? We had a lot of issues because of this recently. We don't understand why it starts advertising a completely wrong port after stop/start.
 
 
 
 

 

Regards

Jo Goossens

 


_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users


_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux