(My last mail seemed to be dropped by the list, reposting in plain-text format) Hello: I have recently done a rolling upgraded on some of my 3.7.6 gluster (AFR + DHT) deploments to 3.10.6, having two problems afterwards: 1) IO error occationally 2) "gluster volume heal vol0 info" hangs after decent amount of files written to the cluster (vary from hours to days) All of my 7 upgraded 3.10 clusters reproduced these problems, and then I tried upgraded them to 3.12.11, but nothing improved. I have submited a bug on https://bugzilla.redhat.com/show_bug.cgi?id=1618932 , so far no one seams to look into it. The problem of heal hanging is most annoying because it makes me impossible to monitor my clusters. So I tried to look into the code and analyze with the DEBUG & TRACE log, found the following situation. 1) I have "network.ping-timeout: 120" in volume option . If I modify network.ping-timeout to 0, "gluster volume heal vol0 info" will finish eventually after 10~20 minutes (8x3 bricks setup). And it says no file to heal. It takes even longer when cluster is larger (36x3 bricks setup). It does not matter even I stop all my IO or restart all glusterd & glusterfsd process. heal info still hangs. "gluster volume status" is all normal. All my brick servers are in low latency LAN enviroment. 2) As one of my clusters having this problem is testing setup, currently I may do some debug there. I looked into heal/src/glfs-heal.c, compile the code and added some log, the initialize call on "ret = glfs_init (fs)" hangs. I traced to " ret = glfs_chdir (fs, "/");" and then "dht_lookup( path='/' )", it never get to "dht_readdirp_cbk()" I stopped all client IO operation on my gluster volume and then ran "glfsheal vol0". I looked into /var/log/gluster/glfshead_vol0.log, there is messive looping of reconnect attempt on all bricks. eg: [2018-09-14 05:56:09.395869] T [rpc-clnt.c:407:rpc_clnt_reconnect] 0-vol0-client-12: attempting reconnect [2018-09-14 05:56:09.396160] T [socket.c:3125:socket_connect] (--> /lib/libglusterfs.so.0(_gf_log_callingfn+0x1ee)[0x7f10d4e79778] (--> /usr/lib/glus terfs//rpc-transport/socket.so(+0xd442)[0x7f10c647c442] (--> /lib/libgfrpc.so.0(rpc_transport_connect+0x93)[0x7f10d437de43] (--> /lib/libgfrpc.so.0(r pc_clnt_reconnect+0x136)[0x7f10d438121b] (--> /lib/libglusterfs.so.0(+0x3e4e9)[0x7f10d4e894e9] ))))) 0-vol0-client-12: connect () called on transport already connected [2018-09-14 05:56:09.396194] T [rpc-clnt.c:407:rpc_clnt_reconnect] 0-vol0-client-10: attempting reconnect [2018-09-14 05:56:09.396424] T [socket.c:3125:socket_connect] (--> /lib/libglusterfs.so.0(_gf_log_callingfn+0x1ee)[0x7f10d4e79778] (--> /usr/lib/glus terfs//rpc-transport/socket.so(+0xd442)[0x7f10c647c442] (--> /lib/libgfrpc.so.0(rpc_transport_connect+0x93)[0x7f10d437de43] (--> /lib/libgfrpc.so.0(r pc_clnt_reconnect+0x136)[0x7f10d438121b] (--> /lib/libglusterfs.so.0(+0x3e4e9)[0x7f10d4e894e9] ))))) 0-vol0-client-10: connect () called on transport already connected It's also weird that 0-vol0-client-5 was having huge ammount RPC ping latency (29 seconds it's not possible because simple ICMP pinging that server is less then 0.1 second) root@storage0-cluster0:~# less /vgrep ping /var/log/glusterfs/glfsheal-vol0.log | grep laten [2018-09-14 05:13:16.008866] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-0: Ping latency is 0ms [2018-09-14 05:13:16.016669] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-1: Ping latency is 0ms [2018-09-14 05:13:16.021769] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-2: Ping latency is 0ms [2018-09-14 05:13:16.033761] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-3: Ping latency is 4ms [2018-09-14 05:13:16.101962] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-4: Ping latency is 62ms [2018-09-14 05:13:46.105502] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-5: Ping latency is 29996ms [2018-09-14 05:20:13.895285] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-0: Ping latency is 0ms [2018-09-14 05:20:13.901476] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-1: Ping latency is 0ms [2018-09-14 05:20:13.908373] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-2: Ping latency is 0ms [2018-09-14 05:20:13.918782] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-3: Ping latency is 4ms [2018-09-14 05:20:13.987162] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-4: Ping latency is 62ms [2018-09-14 05:20:43.992528] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-5: Ping latency is 29998ms [2018-09-14 05:25:11.983259] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-0: Ping latency is 0ms [2018-09-14 05:25:11.991199] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-1: Ping latency is 0ms [2018-09-14 05:25:11.996933] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-2: Ping latency is 0ms [2018-09-14 05:25:12.007867] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-3: Ping latency is 4ms [2018-09-14 05:25:12.076057] D [rpc-clnt-ping.c:211:rpc_clnt_ping_cbk] 0-vol0-client-4: Ping latency is 62ms I checked vol0-client-5, nothing special comparing to other brick. CPU is not high, disk has no problem. No warning or error log in /var/log/gluster/brick/data-brk01-src.log, After I turn on TRACE level log on brick, the logging is too much to analyze, and no apparent difference against other brick. I hope someone can have some idea about this, or pointing the direction to look into may help. Thanks very much _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users