Apologies for coming back late on this as I was occupied with some other priority stuffs. I think your analysis is spot on.
To add on why we hit this race, when glusterd restarts, glusterd_svcs_manager () is invoked from glusterd_compare_friend_data () and through glusterd_restart_bricks () where as in the case of latter this function is invoked through a synctask which means you can end up with multiple threads executing glusterd_restart_bricks () at one point of time and with that parallelism it's quite possible that one thread tries to kill the process and immediately the other tries to bring it up and since its the glusterfsd process which cleans up the pidfile this race is exposed.
The good news is that I have a patch (mainly for one of the brick multiplexing bug) https://review.gluster.org/#/c/19357 which actually addresses this problem. Would appreciate if you can take a look at it, apply this change and retest it.
To add on why we hit this race, when glusterd restarts, glusterd_svcs_manager () is invoked from glusterd_compare_friend_data () and through glusterd_restart_bricks () where as in the case of latter this function is invoked through a synctask which means you can end up with multiple threads executing glusterd_restart_bricks () at one point of time and with that parallelism it's quite possible that one thread tries to kill the process and immediately the other tries to bring it up and since its the glusterfsd process which cleans up the pidfile this race is exposed.
The good news is that I have a patch (mainly for one of the brick multiplexing bug) https://review.gluster.org/#/c/19357 which actually addresses this problem. Would appreciate if you can take a look at it, apply this change and retest it.
On Wed, Jan 24, 2018 at 8:39 AM, Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou@xxxxxxxxxxxxxxx> wrote:
Hi,When we are doing glusterfs tests recently, we occasionally meet the following issue that, sometimes after when reboot sn node, the glustershd process does not come up.Status of volume: exportGluster processTCP Port RDMA Port Online Pid ------------------------------------------------------------ ------------------ Brick sn-0.local:/mnt/bricks/export/brick 49154 0 Y 1157 Brick sn-1.local:/mnt/bricks/export/brick 49152 0 Y 1123 Self-heal Daemon on localhost N/A N/A Y 1642Self-heal Daemon on sn-2.local N/A N/A N N/ASelf-heal Daemon on sn-1.local N/A N/A Y 1877Task Status of Volume export------------------------------------------------------------ ------------------ There are no active volume tasks[root@sn-0:/root]When I check the glusterd.log on sn-2 I find that[2018-01-23 10:11:29.534588] I [MSGID: 106490] [glusterd-handler.c:2540:__glusterd_handle_incoming_ friend_req] 0-glusterd: Received probe from uuid: 7c736522-04c6-4677-85cf- e2ab5d61baad [2018-01-23 10:11:29.542764] I [MSGID: 106493] [glusterd-handler.c:3800:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to sn-1.local (0), ret: 0, op_ret: 0 [2018-01-23 10:11:29.561677] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7c736522-04c6-4677-85cf- e2ab5d61baad [2018-01-23 10:11:29.561719] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2018-01-23 10:11:29.570655] I [MSGID: 106492] [glusterd-handler.c:2718:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7c736522-04c6-4677-85cf- e2ab5d61baad [2018-01-23 10:11:29.570686] I [MSGID: 106502] [glusterd-handler.c:2763:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2018-01-23 10:11:29.579589] I [MSGID: 106493] [glusterd-rpc-ops.c:486:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 7c736522-04c6-4677-85cf- e2ab5d61baad, host: sn-1.local, port: 0 [2018-01-23 10:11:29.588190] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7c736522-04c6-4677-85cf- e2ab5d61baad [2018-01-23 10:11:29.588258] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: nfs already stopped [2018-01-23 10:11:29.588280] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: nfs service is stopped [2018-01-23 10:11:29.588291] I [MSGID: 106600] [glusterd-nfs-svc.c:82:glusterd_nfssvc_manager] 0-management: nfs/server.so xlator is not installed [2018-01-23 10:11:30.091487] I [MSGID: 106568] [glusterd-proc-mgmt.c:87:glusterd_proc_stop] 0-management: Stopping glustershd daemon running in pid: 1101 [2018-01-23 10:11:30.092964] W [socket.c:593:__socket_rwv] 0-glustershd: readv on /var/run/gluster/d6b222161f840fc2854cb04ae3d7d6 58.socket failed (No data available) [2018-01-23 10:11:31.091657] I [MSGID: 106568] [glusterd-proc-mgmt.c:110:glusterd_proc_stop] 0-management: second time kill 1101 //this log is added by me [2018-01-23 10:11:31.091711] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: glustershd service is stopped [2018-01-23 10:11:31.097696] I [MSGID: 106493] [glusterd-rpc-ops.c:701:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7c736522-04c6-4677-85cf- e2ab5d61baad [2018-01-23 10:11:31.591881] I [MSGID: 106567] [glusterd-svc-mgmt.c:162:glusterd_svc_start] 0-management: found svc glustershd is already running [2018-01-23 10:11:35.592332] W [socket.c:3314:socket_connect] 0-glustershd: Ignore failed connection attempt on /var/run/gluster/d6b222161f840fc2854cb04ae3d7d6 58.socket, (No such file or directory) [2018-01-23 10:11:35.592757] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: quotad already stopped [2018-01-23 10:11:35.592818] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: quotad service is stopped [2018-01-23 10:11:35.593163] I [MSGID: 106132] [glusterd-proc-mgmt.c:83:glusterd_proc_stop] 0-management: bitd already stopped [2018-01-23 10:11:35.593189] I [MSGID: 106568] [glusterd-svc-mgmt.c:238:glusterd_svc_stop] 0-management: bitd service is stopped From glustershd.log[2018-01-23 10:11:29.535459] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-mstate-client-1: disconnected from mstate-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2018-01-23 10:11:29.535463] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-services-client-1: disconnected from services-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2018-01-23 10:11:30.091763] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+ 0x74a5) [0x7f318d7364a5] -->/usr/sbin/glusterfs( glusterfs_sigwaiter+0xdc) [0x40a513] -->/usr/sbin/glusterfs( cleanup_and_exit+0x77) [0x4089ec] ) 0-: received signum (15), shutting down So, the story seems to go like this:
- glusterd stop glustershd by sending signal SIGTEM
- this does not work, at least from glusterd side it still find glustershd pid file, so it think glustershd is still alive, so it try again by sending SIGKILL
- glusterd try to start glustershd however, it find glustershd pid file still exists, so it think there is no need to start glustershd process again
- we end up with no glustershd process
so should we also remove pid file in signal SIGKILL? Or should we add remove pid in glusterd_svc_stop when glusterd_proc_stop return 0?If I am not making this clear , welcome to discuss this issue with me , looking forward for your reply!Best regards,
Cynthia (周琳)MBB SM HETRAN SW3 MATRIXStorage
Mobile: +86 (0)18657188311
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel