Re: gluster 3.7.3 - volume heal info hangs - unknown heal status

Anuradha Talur <atalur@xxxxxxxxxx> · Thu, 24 Sep 2015 09:18:00 -0400 (EDT)

----- Original Message -----
> From: "Andreas Mather" <andreas@xxxxxxxxxxxxxxx>
> To: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx>
> Sent: Thursday, September 24, 2015 1:24:12 PM
> Subject:  gluster 3.7.3 - volume heal info hangs - unknown	heal status
> 
> Hi!
> 
> Our provider had network maintenance this night, so 2 of our 4 servers got
> disconnected and reconnected. Since we knew this was coming, we shifted all
> work load off the affected servers. This morning, most of the cluster seems
> fine, but for one volume, no heal info can be retrieved, so we basically
> don't know about the healing state of the volume. The volume is a replica 2
> volume between vhost4-int/brick1 and vhost3-int/brick2.
> 
> The volume is accessible, but since I don't get any heal info, I don't know
> if it is probably replicated. Any help to resolve this situation is highly
> appreciated.
> 
> hangs forever:
> [root@vhost4 ~]# gluster volume heal vol4 info
> 
> glfsheal-vol4.log:
> [2015-09-24 07:47:59.284723] I [MSGID: 101190]
> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with
> index 1
> [2015-09-24 07:47:59.293735] I [MSGID: 101190]
> [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with
> index 2
> [2015-09-24 07:47:59.294061] I [MSGID: 104045] [glfs-master.c:95:notify]
> 0-gfapi: New graph 76686f73-7434-2e61-6c6c-61626f757461 (0) coming up
> [2015-09-24 07:47:59.294081] I [MSGID: 114020] [client.c:2118:notify]
> 0-vol4-client-1: parent translators are ready, attempting connect on
> transport
> [2015-09-24 07:47:59.309470] I [MSGID: 114020] [client.c:2118:notify]
> 0-vol4-client-2: parent translators are ready, attempting connect on
> transport
> [2015-09-24 07:47:59.310525] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
> 0-vol4-client-1: changing port to 49155 (from 0)
> [2015-09-24 07:47:59.315958] I [MSGID: 114057]
> [client-handshake.c:1437:select_server_supported_programs] 0-vol4-client-1:
> Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2015-09-24 07:47:59.316481] I [MSGID: 114046]
> [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-1: Connected to
> vol4-client-1, attached to remote volume '/storage/brick2/brick2'.
> [2015-09-24 07:47:59.316495] I [MSGID: 114047]
> [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-1: Server and
> Client lk-version numbers are not same, reopening the fds
> [2015-09-24 07:47:59.316538] I [MSGID: 108005] [afr-common.c:3960:afr_notify]
> 0-vol4-replicate-0: Subvolume 'vol4-client-1' came back up; going online.
> [2015-09-24 07:47:59.317150] I [MSGID: 114035]
> [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-1: Server
> lk version = 1
> [2015-09-24 07:47:59.320898] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
> 0-vol4-client-2: changing port to 49154 (from 0)
> [2015-09-24 07:47:59.325633] I [MSGID: 114057]
> [client-handshake.c:1437:select_server_supported_programs] 0-vol4-client-2:
> Using Program GlusterFS 3.3, Num (1298437), Version (330)
> [2015-09-24 07:47:59.325780] I [MSGID: 114046]
> [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-2: Connected to
> vol4-client-2, attached to remote volume '/storage/brick1/brick1'.
> [2015-09-24 07:47:59.325791] I [MSGID: 114047]
> [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-2: Server and
> Client lk-version numbers are not same, reopening the fds
> [2015-09-24 07:47:59.333346] I [MSGID: 114035]
> [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-2: Server
> lk version = 1
> [2015-09-24 07:47:59.334545] I [MSGID: 108031]
> [afr-common.c:1745:afr_local_discovery_cbk] 0-vol4-replicate-0: selecting
> local read_child vol4-client-2
> [2015-09-24 07:47:59.335833] I [MSGID: 104041]
> [glfs-resolve.c:862:__glfs_active_subvol] 0-vol4: switched to graph
> 76686f73-7434-2e61-6c6c-61626f757461 (0)
> 
> Questions to this output:
> -) Why does it report " Using Program GlusterFS 3.3, Num (1298437), Version
> (330) ". We run 3.7.3 ?!
> -) guster logs timestamps in UTC not taking server timezone into account. Is
> there a way to fix this?
> 
> etc-glusterfs-glusterd.vol.log:
> no logs to after volume heal info command
> 
> storage-brick1-brick1.log:
> [2015-09-24 07:47:59.325720] I [login.c:81:gf_auth] 0-auth/login: allowed
> user names: 67ef1559-d3a1-403a-b8e7-fb145c3acf4e
> [2015-09-24 07:47:59.325743] I [MSGID: 115029]
> [server-handshake.c:610:server_setvolume] 0-vol4-server: accepted client
> from
> vhost4.allaboutapps.at-14900-2015/09/24-07:47:59:282313-vol4-client-2-0-0
> (version: 3.7.3)
> 
> storage-brick2-brick2.log:
> no logs to after volume heal info command
> 
> 
Hi Andreas,

Could you please provide the following information so that we can understand why the command is hanging?
When the command is hung, run the following command from one of the servers:
`gluster volume statedump <volname>`
This command will generate statedumps of glusterfsd processes in the servers. You can find them at /var/run/gluster . A typical statedump for a brick has "<brick-path>.<pid-of-brick>.dump.<timestamp>" as its name. Could you please attach them and respond?

> Thanks,
> 
> - Andreas
> 
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Thanks,
Anuradha.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users