Re: sometimes when reboot node glustershd process does not come up.

Atin Mukherjee <amukherj@xxxxxxxxxx> · Wed, 7 Feb 2018 21:18:43 +0530

Apologies for coming back late on this as I was occupied with some other priority stuffs. I think your analysis is spot on. 

To add on why we hit this race, when glusterd restarts, glusterd_svcs_manager () is invoked from glusterd_compare_friend_data ()  and through glusterd_restart_bricks () where as in the case of latter this function is invoked through a synctask which means you can end up with multiple threads executing glusterd_restart_bricks () at one point of time and with that parallelism it's quite possible that one thread tries to kill the process and immediately the other tries to bring it up and since its the glusterfsd process which cleans up the pidfile this race is exposed. 

The good news is that I have a patch (mainly for one of the brick multiplexing bug) https://review.gluster.org/#/c/19357 which actually addresses this problem. Would appreciate if you can take a look at it, apply this change and retest it.

On Wed, Jan 24, 2018 at 8:39 AM, Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou@xxxxxxxxxxxxxxx> wrote:

Hi,
        When we are doing glusterfs tests recently, we occasionally meet the following issue that, sometimes after when reboot sn node, the glustershd process does not come up.

Status of volume: export
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sn-0.local:/mnt/bricks/export/brick   49154     0          Y       1157 
Brick sn-1.local:/mnt/bricks/export/brick   49152     0          Y       1123 
Self-heal Daemon on localhost               N/A       N/A        Y       1642 
Self-heal Daemon on sn-2.local              N/A       N/A        N       N/A  
Self-heal Daemon on sn-1.local              N/A       N/A        Y       1877 

Task Status of Volume export
------------------------------------------------------------------------------
There are no active volume tasks

[root@sn-0:/root]

   When I check the glusterd.log on sn-2 I find that 
[2018-01-23 10:11:29.534588] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:29.542764] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to sn-1.local (0), ret: 0, op_ret: 0
[2018-01-23 10:11:29.561677] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:29.561719] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2018-01-23 10:11:29.570655] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:29.570686] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2018-01-23 10:11:29.579589] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad, host: sn-1.local, port: 0
[2018-01-23 10:11:29.588190] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:29.588258] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped
[2018-01-23 10:11:29.588280] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: nfs service is stopped
[2018-01-23 10:11:29.588291] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed
[2018-01-23 10:11:30.091487] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 1101
[2018-01-23 10:11:30.092964] W [socket.c:593:__socket_rwv] 0-glustershd: readv on /var/run/gluster/d6b222161f840fc2854cb04ae3d7d658.socket failed (No data available)
[2018-01-23 10:11:31.091657] I [MSGID: 106568] [glusterd-proc-mgmt.c:110:glusterd_proc_stop] 0-management: second time kill 1101 //this log is added by me
[2018-01-23 10:11:31.091711] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: glustershd service is stopped
[2018-01-23 10:11:31.097696] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:31.591881] I [MSGID: 106567] [glusterd-svc-mgmt.c:162:glusterd_svc_start] 0-management: found svc glustershd is already running
[2018-01-23 10:11:35.592332] W [socket.c:3314:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/d6b222161f840fc2854cb04ae3d7d658.socket, (No such file or directory) 
[2018-01-23 10:11:35.592757] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped
[2018-01-23 10:11:35.592818] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: quotad service is stopped
[2018-01-23 10:11:35.593163] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
[2018-01-23 10:11:35.593189] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: bitd service is stopped

From glustershd.log
[2018-01-23 10:11:29.535459] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-mstate-client-1: disconnected from mstate-client-1. Client process will keep trying to connect to glusterd until brick's port is
available
[2018-01-23 10:11:29.535463] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-services-client-1: disconnected from services-client-1. Client process will keep trying to connect to glusterd until brick's port
is available
[2018-01-23 10:11:30.091763] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x74a5) [0x7f318d7364a5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdc) [0x40a513] -->/usr/sbin/glusterfs(cleanup_and_exit+0x77)
[0x4089ec] ) 0-: received signum (15), shutting down

So, the story seems to go like this:

glusterd stop glustershd by sending signal SIGTEM
this does not work, at least from glusterd side it still find glustershd pid file, so it think glustershd is still alive, so it try again by sending SIGKILL
glusterd try to start glustershd however, it find glustershd pid file still exists, so it think there is no need to start glustershd process again
we end up with no glustershd process

so should we also remove pid file in signal SIGKILL? Or should we add remove pid in glusterd_svc_stop when glusterd_proc_stop return 0?
If I am not making this clear , welcome to discuss this issue with me , looking forward for your reply!
Best regards,

Cynthia （周琳）
MBB SM HETRAN SW3 MATRIX  
Storage         

Mobile: +86 (0)18657188311

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel