Re: sometimes when reboot node glustershd process does not come up.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Apologies for coming back late on this as I was occupied with some other priority stuffs. I think your analysis is spot on.

To add on why we hit this race, when glusterd restarts, glusterd_svcs_manager () is invoked from glusterd_compare_friend_data ()  and through glusterd_restart_bricks () where as in the case of latter this function is invoked through a synctask which means you can end up with multiple threads executing glusterd_restart_bricks () at one point of time and with that parallelism it's quite possible that one thread tries to kill the process and immediately the other tries to bring it up and since its the glusterfsd process which cleans up the pidfile this race is exposed.

The good news is that I have a patch (mainly for one of the brick multiplexing bug) https://review.gluster.org/#/c/19357 which actually addresses this problem. Would appreciate if you can take a look at it, apply this change and retest it.

On Wed, Jan 24, 2018 at 8:39 AM, Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou@xxxxxxxxxxxxxxx> wrote:
Hi,
        When we are doing glusterfs tests recently, we occasionally meet the following issue that, sometimes after when reboot sn node, the glustershd process does not come up.
 
Status of volume: export
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick sn-0.local:/mnt/bricks/export/brick   49154     0          Y       1157
Brick sn-1.local:/mnt/bricks/export/brick   49152     0          Y       1123
Self-heal Daemon on localhost               N/A       N/A        Y       1642
Self-heal Daemon on sn-2.local              N/A       N/A        N       N/A 
Self-heal Daemon on sn-1.local              N/A       N/A        Y       1877
 
Task Status of Volume export
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@sn-0:/root]
 
   When I check the glusterd.log on sn-2 I find that
[2018-01-23 10:11:29.534588] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:29.542764] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to sn-1.local (0), ret: 0, op_ret: 0
[2018-01-23 10:11:29.561677] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:29.561719] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2018-01-23 10:11:29.570655] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:29.570686] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend
[2018-01-23 10:11:29.579589] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad, host: sn-1.local, port: 0
[2018-01-23 10:11:29.588190] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:29.588258] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped
[2018-01-23 10:11:29.588280] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: nfs service is stopped
[2018-01-23 10:11:29.588291] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed
[2018-01-23 10:11:30.091487] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 1101
[2018-01-23 10:11:30.092964] W [socket.c:593:__socket_rwv] 0-glustershd: readv on /var/run/gluster/d6b222161f840fc2854cb04ae3d7d658.socket failed (No data available)
[2018-01-23 10:11:31.091657] I [MSGID: 106568] [glusterd-proc-mgmt.c:110:glusterd_proc_stop] 0-management: second time kill 1101 //this log is added by me
[2018-01-23 10:11:31.091711] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: glustershd service is stopped
[2018-01-23 10:11:31.097696] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7c736522-04c6-4677-85cf-e2ab5d61baad
[2018-01-23 10:11:31.591881] I [MSGID: 106567] [glusterd-svc-mgmt.c:162:glusterd_svc_start] 0-management: found svc glustershd is already running
[2018-01-23 10:11:35.592332] W [socket.c:3314:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/d6b222161f840fc2854cb04ae3d7d658.socket, (No such file or directory)
[2018-01-23 10:11:35.592757] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped
[2018-01-23 10:11:35.592818] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: quotad service is stopped
[2018-01-23 10:11:35.593163] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped
[2018-01-23 10:11:35.593189] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: bitd service is stopped
 
From glustershd.log
[2018-01-23 10:11:29.535459] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-mstate-client-1: disconnected from mstate-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-01-23 10:11:29.535463] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-services-client-1: disconnected from services-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2018-01-23 10:11:30.091763] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x74a5) [0x7f318d7364a5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdc) [0x40a513] -->/usr/sbin/glusterfs(cleanup_and_exit+0x77) [0x4089ec] ) 0-: received signum (15), shutting down
 
So, the story seems to go like this:
 
  1. glusterd stop glustershd by sending signal SIGTEM
  2. this does not work, at least from glusterd side it still find glustershd pid file, so it think glustershd is still alive, so it try again by sending SIGKILL
  3. glusterd try to start glustershd however, it find glustershd pid file still exists, so it think there is no need to start glustershd process again
  4. we end up with no glustershd process
 
so should we also remove pid file in signal SIGKILL? Or should we add remove pid in glusterd_svc_stop when glusterd_proc_stop return 0?
If I am not making this clear , welcome to discuss this issue with me , looking forward for your reply!
Best regards,
Cynthia (周琳)
MBB SM HETRAN SW3 MATRIX 
Storage        
Mobile: +86 (0)18657188311
 
 
 

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux