Hi!
It's VMs based on KVM/qemu managed by libvirtd. I figured I could see the heal status by comparing the bricks: nothing was replicated, but new files were (after a long delay of about 5 mins). So I wanted to see if existing files (VM images) will be healed if I would stop a VM (close any open handle on the file), which turned out not to be the case.
I ended up shutting down all VMs and restarting the server. Afterwards healing worked as expected....
- Andreas
On Mon, Oct 5, 2015 at 1:01 PM, Anuradha Talur <atalur@xxxxxxxxxx> wrote:
----- Original Message -----
> From: "Andreas Mather" <andreas@xxxxxxxxxxxxxxx>
> To: "Anuradha Talur" <atalur@xxxxxxxxxx>
> Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx>
> Sent: Thursday, September 24, 2015 6:59:38 PM
> Subject: Re: gluster 3.7.3 - volume heal info hangs - unknown heal status
>
> Hi Anuradha!
>
> Thanks for your reply! Attached you can find the dump files. As I'm not
> sure if they make their way through as attachments, here're links to them
> as well:
>
> brick1 - http://pastebin.com/3ivkhuRH
> brick2 - http://pastebin.com/77sT1mut
Hi,
I see some blocked locks from the statedump.
Could you let me know what kind of workload you had when you observed the hang?
-->
> - Andreas
>
>
>
>
> On Thu, Sep 24, 2015 at 3:18 PM, Anuradha Talur <atalur@xxxxxxxxxx> wrote:
>
> >
> >
> > ----- Original Message -----
> > > From: "Andreas Mather" <andreas@xxxxxxxxxxxxxxx>
> > > To: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx>
> > > Sent: Thursday, September 24, 2015 1:24:12 PM
> > > Subject: gluster 3.7.3 - volume heal info hangs -
> > unknown heal status
> > >
> > > Hi!
> > >
> > > Our provider had network maintenance this night, so 2 of our 4 servers
> > got
> > > disconnected and reconnected. Since we knew this was coming, we shifted
> > all
> > > work load off the affected servers. This morning, most of the cluster
> > seems
> > > fine, but for one volume, no heal info can be retrieved, so we basically
> > > don't know about the healing state of the volume. The volume is a
> > replica 2
> > > volume between vhost4-int/brick1 and vhost3-int/brick2.
> > >
> > > The volume is accessible, but since I don't get any heal info, I don't
> > know
> > > if it is probably replicated. Any help to resolve this situation is
> > highly
> > > appreciated.
> > >
> > > hangs forever:
> > > [root@vhost4 ~]# gluster volume heal vol4 info
> > >
> > > glfsheal-vol4.log:
> > > [2015-09-24 07:47:59.284723] I [MSGID: 101190]
> > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
> > with
> > > index 1
> > > [2015-09-24 07:47:59.293735] I [MSGID: 101190]
> > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread
> > with
> > > index 2
> > > [2015-09-24 07:47:59.294061] I [MSGID: 104045] [glfs-master.c:95:notify]
> > > 0-gfapi: New graph 76686f73-7434-2e61-6c6c-61626f757461 (0) coming up
> > > [2015-09-24 07:47:59.294081] I [MSGID: 114020] [client.c:2118:notify]
> > > 0-vol4-client-1: parent translators are ready, attempting connect on
> > > transport
> > > [2015-09-24 07:47:59.309470] I [MSGID: 114020] [client.c:2118:notify]
> > > 0-vol4-client-2: parent translators are ready, attempting connect on
> > > transport
> > > [2015-09-24 07:47:59.310525] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
> > > 0-vol4-client-1: changing port to 49155 (from 0)
> > > [2015-09-24 07:47:59.315958] I [MSGID: 114057]
> > > [client-handshake.c:1437:select_server_supported_programs]
> > 0-vol4-client-1:
> > > Using Program GlusterFS 3.3, Num (1298437), Version (330)
> > > [2015-09-24 07:47:59.316481] I [MSGID: 114046]
> > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-1:
> > Connected to
> > > vol4-client-1, attached to remote volume '/storage/brick2/brick2'.
> > > [2015-09-24 07:47:59.316495] I [MSGID: 114047]
> > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-1: Server
> > and
> > > Client lk-version numbers are not same, reopening the fds
> > > [2015-09-24 07:47:59.316538] I [MSGID: 108005]
> > [afr-common.c:3960:afr_notify]
> > > 0-vol4-replicate-0: Subvolume 'vol4-client-1' came back up; going online.
> > > [2015-09-24 07:47:59.317150] I [MSGID: 114035]
> > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-1:
> > Server
> > > lk version = 1
> > > [2015-09-24 07:47:59.320898] I [rpc-clnt.c:1819:rpc_clnt_reconfig]
> > > 0-vol4-client-2: changing port to 49154 (from 0)
> > > [2015-09-24 07:47:59.325633] I [MSGID: 114057]
> > > [client-handshake.c:1437:select_server_supported_programs]
> > 0-vol4-client-2:
> > > Using Program GlusterFS 3.3, Num (1298437), Version (330)
> > > [2015-09-24 07:47:59.325780] I [MSGID: 114046]
> > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-2:
> > Connected to
> > > vol4-client-2, attached to remote volume '/storage/brick1/brick1'.
> > > [2015-09-24 07:47:59.325791] I [MSGID: 114047]
> > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-2: Server
> > and
> > > Client lk-version numbers are not same, reopening the fds
> > > [2015-09-24 07:47:59.333346] I [MSGID: 114035]
> > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-2:
> > Server
> > > lk version = 1
> > > [2015-09-24 07:47:59.334545] I [MSGID: 108031]
> > > [afr-common.c:1745:afr_local_discovery_cbk] 0-vol4-replicate-0: selecting
> > > local read_child vol4-client-2
> > > [2015-09-24 07:47:59.335833] I [MSGID: 104041]
> > > [glfs-resolve.c:862:__glfs_active_subvol] 0-vol4: switched to graph
> > > 76686f73-7434-2e61-6c6c-61626f757461 (0)
> > >
> > > Questions to this output:
> > > -) Why does it report " Using Program GlusterFS 3.3, Num (1298437),
> > Version
> > > (330) ". We run 3.7.3 ?!
> > > -) guster logs timestamps in UTC not taking server timezone into
> > account. Is
> > > there a way to fix this?
> > >
> > > etc-glusterfs-glusterd.vol.log:
> > > no logs to after volume heal info command
> > >
> > > storage-brick1-brick1.log:
> > > [2015-09-24 07:47:59.325720] I [login.c:81:gf_auth] 0-auth/login: allowed
> > > user names: 67ef1559-d3a1-403a-b8e7-fb145c3acf4e
> > > [2015-09-24 07:47:59.325743] I [MSGID: 115029]
> > > [server-handshake.c:610:server_setvolume] 0-vol4-server: accepted client
> > > from
> > > vhost4.allaboutapps.at-14900-2015/09/24-07:47:59:282313-vol4-client-2-0-0
> > > (version: 3.7.3)
> > >
> > > storage-brick2-brick2.log:
> > > no logs to after volume heal info command
> > >
> > >
> > Hi Andreas,
> >
> > Could you please provide the following information so that we can
> > understand why the command is hanging?
> > When the command is hung, run the following command from one of the
> > servers:
> > `gluster volume statedump <volname>`
> > This command will generate statedumps of glusterfsd processes in the
> > servers. You can find them at /var/run/gluster . A typical statedump for a
> > brick has "<brick-path>.<pid-of-brick>.dump.<timestamp>" as its name. Could
> > you please attach them and respond?
> >
> > > Thanks,
> > >
> > > - Andreas
> > >
> > >
> > >
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users@xxxxxxxxxxx
> > > http://www.gluster.org/mailman/listinfo/gluster-users
> >
> > --
> > Thanks,
> > Anuradha.
> >
>
Thanks,
Anuradha.
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users