Re: gluster 3.7.3 - volume heal info hangs - unknown heal status

Andreas Mather <andreas@xxxxxxxxxxxxxxx> · Mon, 5 Oct 2015 14:26:35 +0200

Hi!
It's VMs based on KVM/qemu managed by libvirtd. I figured I could see the heal status by comparing the bricks: nothing was replicated, but new files were (after a long delay of about 5 mins). So I wanted to see if existing files (VM images) will be healed if I would stop a VM (close any open handle on the file), which turned out not to be the case.

I ended up shutting down all VMs and restarting the server. Afterwards healing worked as expected....

- Andreas

On Mon, Oct 5, 2015 at 1:01 PM, Anuradha Talur <atalur@xxxxxxxxxx> wrote:

----- Original Message -----

> From: "Andreas Mather" <andreas@xxxxxxxxxxxxxxx>

> To: "Anuradha Talur" <atalur@xxxxxxxxxx>

> Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx>

> Sent: Thursday, September 24, 2015 6:59:38 PM

> Subject: Re:  gluster 3.7.3 - volume heal info hangs - unknown heal status

>

> Hi Anuradha!

>

> Thanks for your reply! Attached you can find the dump files. As I'm not

> sure if they make their way through as attachments, here're links to them

> as well:

>

> brick1 - http://pastebin.com/3ivkhuRH

> brick2 - http://pastebin.com/77sT1mut

Hi,

I see some blocked locks from the statedump.

Could you let me know what kind of workload you had when you observed the hang?

>

> - Andreas

>

>

>

>

> On Thu, Sep 24, 2015 at 3:18 PM, Anuradha Talur <atalur@xxxxxxxxxx> wrote:

>

> >

> >

> > ----- Original Message -----

> > > From: "Andreas Mather" <andreas@xxxxxxxxxxxxxxx>

> > > To: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx>

> > > Sent: Thursday, September 24, 2015 1:24:12 PM

> > > Subject:  gluster 3.7.3 - volume heal info hangs -

> > unknown     heal status

> > >

> > > Hi!

> > >

> > > Our provider had network maintenance this night, so 2 of our 4 servers

> > got

> > > disconnected and reconnected. Since we knew this was coming, we shifted

> > all

> > > work load off the affected servers. This morning, most of the cluster

> > seems

> > > fine, but for one volume, no heal info can be retrieved, so we basically

> > > don't know about the healing state of the volume. The volume is a

> > replica 2

> > > volume between vhost4-int/brick1 and vhost3-int/brick2.

> > >

> > > The volume is accessible, but since I don't get any heal info, I don't

> > know

> > > if it is probably replicated. Any help to resolve this situation is

> > highly

> > > appreciated.

> > >

> > > hangs forever:

> > > [root@vhost4 ~]# gluster volume heal vol4 info

> > >

> > > glfsheal-vol4.log:

> > > [2015-09-24 07:47:59.284723] I [MSGID: 101190]

> > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> > with

> > > index 1

> > > [2015-09-24 07:47:59.293735] I [MSGID: 101190]

> > > [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread

> > with

> > > index 2

> > > [2015-09-24 07:47:59.294061] I [MSGID: 104045] [glfs-master.c:95:notify]

> > > 0-gfapi: New graph 76686f73-7434-2e61-6c6c-61626f757461 (0) coming up

> > > [2015-09-24 07:47:59.294081] I [MSGID: 114020] [client.c:2118:notify]

> > > 0-vol4-client-1: parent translators are ready, attempting connect on

> > > transport

> > > [2015-09-24 07:47:59.309470] I [MSGID: 114020] [client.c:2118:notify]

> > > 0-vol4-client-2: parent translators are ready, attempting connect on

> > > transport

> > > [2015-09-24 07:47:59.310525] I [rpc-clnt.c:1819:rpc_clnt_reconfig]

> > > 0-vol4-client-1: changing port to 49155 (from 0)

> > > [2015-09-24 07:47:59.315958] I [MSGID: 114057]

> > > [client-handshake.c:1437:select_server_supported_programs]

> > 0-vol4-client-1:

> > > Using Program GlusterFS 3.3, Num (1298437), Version (330)

> > > [2015-09-24 07:47:59.316481] I [MSGID: 114046]

> > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-1:

> > Connected to

> > > vol4-client-1, attached to remote volume '/storage/brick2/brick2'.

> > > [2015-09-24 07:47:59.316495] I [MSGID: 114047]

> > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-1: Server

> > and

> > > Client lk-version numbers are not same, reopening the fds

> > > [2015-09-24 07:47:59.316538] I [MSGID: 108005]

> > [afr-common.c:3960:afr_notify]

> > > 0-vol4-replicate-0: Subvolume 'vol4-client-1' came back up; going online.

> > > [2015-09-24 07:47:59.317150] I [MSGID: 114035]

> > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-1:

> > Server

> > > lk version = 1

> > > [2015-09-24 07:47:59.320898] I [rpc-clnt.c:1819:rpc_clnt_reconfig]

> > > 0-vol4-client-2: changing port to 49154 (from 0)

> > > [2015-09-24 07:47:59.325633] I [MSGID: 114057]

> > > [client-handshake.c:1437:select_server_supported_programs]

> > 0-vol4-client-2:

> > > Using Program GlusterFS 3.3, Num (1298437), Version (330)

> > > [2015-09-24 07:47:59.325780] I [MSGID: 114046]

> > > [client-handshake.c:1213:client_setvolume_cbk] 0-vol4-client-2:

> > Connected to

> > > vol4-client-2, attached to remote volume '/storage/brick1/brick1'.

> > > [2015-09-24 07:47:59.325791] I [MSGID: 114047]

> > > [client-handshake.c:1224:client_setvolume_cbk] 0-vol4-client-2: Server

> > and

> > > Client lk-version numbers are not same, reopening the fds

> > > [2015-09-24 07:47:59.333346] I [MSGID: 114035]

> > > [client-handshake.c:193:client_set_lk_version_cbk] 0-vol4-client-2:

> > Server

> > > lk version = 1

> > > [2015-09-24 07:47:59.334545] I [MSGID: 108031]

> > > [afr-common.c:1745:afr_local_discovery_cbk] 0-vol4-replicate-0: selecting

> > > local read_child vol4-client-2

> > > [2015-09-24 07:47:59.335833] I [MSGID: 104041]

> > > [glfs-resolve.c:862:__glfs_active_subvol] 0-vol4: switched to graph

> > > 76686f73-7434-2e61-6c6c-61626f757461 (0)

> > >

> > > Questions to this output:

> > > -) Why does it report " Using Program GlusterFS 3.3, Num (1298437),

> > Version

> > > (330) ". We run 3.7.3 ?!

> > > -) guster logs timestamps in UTC not taking server timezone into

> > account. Is

> > > there a way to fix this?

> > >

> > > etc-glusterfs-glusterd.vol.log:

> > > no logs to after volume heal info command

> > >

> > > storage-brick1-brick1.log:

> > > [2015-09-24 07:47:59.325720] I [login.c:81:gf_auth] 0-auth/login: allowed

> > > user names: 67ef1559-d3a1-403a-b8e7-fb145c3acf4e

> > > [2015-09-24 07:47:59.325743] I [MSGID: 115029]

> > > [server-handshake.c:610:server_setvolume] 0-vol4-server: accepted client

> > > from

> > > vhost4.allaboutapps.at-14900-2015/09/24-07:47:59:282313-vol4-client-2-0-0

> > > (version: 3.7.3)

> > >

> > > storage-brick2-brick2.log:

> > > no logs to after volume heal info command

> > >

> > >

> > Hi Andreas,

> >

> > Could you please provide the following information so that we can

> > understand why the command is hanging?

> > When the command is hung, run the following command from one of the

> > servers:

> > `gluster volume statedump <volname>`

> > This command will generate statedumps of glusterfsd processes in the

> > servers. You can find them at /var/run/gluster . A typical statedump for a

> > brick has "<brick-path>.<pid-of-brick>.dump.<timestamp>" as its name. Could

> > you please attach them and respond?

> >

> > > Thanks,

> > >

> > > - Andreas

> > >

> > >

> > >

> > > _______________________________________________

> > > Gluster-users mailing list

> > > Gluster-users@xxxxxxxxxxx

> > > http://www.gluster.org/mailman/listinfo/gluster-users

> >

> > --

> > Thanks,

> > Anuradha.

> >

>

--

Thanks,

Anuradha.

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users