discussion about mount slow issue when stop sn-0

"Zhou, Cynthia (Nokia - CN/Hangzhou)" <cynthia.zhou@xxxxxxxxx> · Mon, 21 Nov 2016 03:19:04 +0000

Hi Gluster experts,

Good day!
I am checking one issue related to Glusterfs, and found following issue.

Description of issue:
1>our cluster env has three storage node , sn-0(169.254.0.24)and sn-1(169.254.0.23) is deployed as replicate storage nodes. Sn-2 is quorum node.
2>stop sn-0 node(so sn-0 become unreachable suddenly), and execute mound command: time mount -t glusterfs 169.254.0.24:/log /mnt/test2 -o rw,backupvolfile-server=169.254.0.23
3> this command will not finish until 5 minutes pass.
4> wait some time (10 minutes) , execute this mount command again, the same command will finish within several seconds.

My analysis:
On the second time the mount command could return quickly, while the first time mount hang 5 minutes, I compare the glusterfs log file, and find 
1>on the first time the rpc client reconnect keeps trying until following prints:

[2016-11-17 02:50:23.593340] E [socket.c:2282:socket_connect_finish] 0-log-client-0: connection to 169.254.0.24:24007 failed (Connection timed out)
  [2016-11-17 02:50:23.593360] T [socket.c:738:__socket_disconnect] 0-log-client-0: disconnecting 0xe2a280, state=0 gen=0 sock=12

[2016-11-17 02:50:23.593370] D [socket.c:719:__socket_shutdown] 0-log-client-0: shutdown() returned -1. Transport endpoint is not connected

[2016-11-17 02:50:23.593378] D [socket.c:2359:socket_event_handler] 0-transport: disconnecting now
but the begin time of this retry is:

[2016-11-17 02:48:20.416900] T [rpc-clnt.c:418:rpc_clnt_reconnect] 0-log-client-0: attempting reconnect
Glusterfs process try a lot of times and the time elapse is around 2 minutes
2>On the second time the rpc client also keeps reconnecting but each time it returns very quickly with following prints:

[2016-11-17 04:27:06.045700] E [socket.c:2282:socket_connect_finish] 0-log-client-0: connection to 169.254.0.24:24007 failed (No route to host)

[2016-11-17 04:27:06.045751] T [socket.c:738:__socket_disconnect] 0-log-client-0: disconnecting 0xffc4a0, state=0 gen=0 sock=12

[2016-11-17 04:27:06.045760] D [socket.c:719:__socket_shutdown] 0-log-client-0: shutdown() returned -1. Transport endpoint is not connected

[2016-11-17 04:27:06.045766] D [socket.c:2359:socket_event_handler] 0-transport: disconnecting now

Questions:
      Why socket_connect_finish is called much later for the first time mount? I assume it is because in the first mount connect() timeout is slowly and the second
time connect() return much faster? 
      Is there any way to control the rpc_clnt_reconnect times? Because seems this is a waste, no need to retry so many times.

Attachment: two times mount trace log of glusterfs.

Best regards,

Cynthia （周琳）
MBB SM HETRAN SW3 MATRIX  
Storage         

Mobile: +86 (0)18657188311

<<attachment: when_nova_stop_sn-0_mount_slow_on_sn1_and_sn2_issue.zip>>
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel