Re: [glusterfs-3.6.0beta3-0.11.gitd01b00a] gluster volume status is running even though the Disk is detached

Kiran Patil <kiran@xxxxxxxxxxxxx> · Tue, 28 Oct 2014 14:08:32 +0530

The content of file zp2-brick2.log is at http://ur1.ca/iku0l (http://fpaste.org/145714/44849041/ )
I can't open the file /zp2/brick2/.glusterfs/health_check since it hangs due to no disk present.

Let me know the filename pattern, so that I can find it.

On Tue, Oct 28, 2014 at 1:42 PM, Niels de Vos <ndevos@xxxxxxxxxx> wrote:
On Tue, Oct 28, 2014 at 01:10:56PM +0530, Kiran Patil wrote:

> I applied the patches, compiled and installed the gluster.

>

> # glusterfs --version

> glusterfs 3.7dev built on Oct 28 2014 12:03:10

> Repository revision: git://git.gluster.com/glusterfs.git

> Copyright (c) 2006-2013 Red Hat, Inc. <http://www.redhat.com/>

> GlusterFS comes with ABSOLUTELY NO WARRANTY.

> It is licensed to you under your choice of the GNU Lesser

> General Public License, version 3 or any later version (LGPLv3

> or later), or the GNU General Public License, version 2 (GPLv2),

> in all cases as published by the Free Software Foundation.

>

> # git log

> commit 990ce16151c3af17e4cdaa94608b737940b60e4d

> Author: Lalatendu Mohanty <lmohanty@xxxxxxxxxx>

> Date:   Tue Jul 1 07:52:27 2014 -0400

>

>     Posix: Brick failure detection fix for ext4 filesystem

> ...

> ...

>

> I see below messages

Many thanks Kiran!

Do you have the messages from the brick that uses the zp2 mountpoint?

There also should be a file with a timestamp when the last check was

done successfully. If the brick is still running, this timestamp should

get updated every storage.health-check-interval seconds:

    /zp2/brick2/.glusterfs/health_check

Niels

>

> File /var/log/glusterfs/etc-glusterfs-glusterd.vol.log :

>

> The message "I [MSGID: 106005]

> [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick

> 192.168.1.246:/zp2/brick2 has disconnected from glusterd." repeated 39

> times between [2014-10-28 05:58:09.209419] and [2014-10-28 06:00:06.226330]

> [2014-10-28 06:00:09.226507] W [socket.c:545:__socket_rwv] 0-management:

> readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid

> argument)

> [2014-10-28 06:00:09.226712] I [MSGID: 106005]

> [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick

> 192.168.1.246:/zp2/brick2 has disconnected from glusterd.

> [2014-10-28 06:00:12.226881] W [socket.c:545:__socket_rwv] 0-management:

> readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid

> argument)

> [2014-10-28 06:00:15.227249] W [socket.c:545:__socket_rwv] 0-management:

> readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid

> argument)

> [2014-10-28 06:00:18.227616] W [socket.c:545:__socket_rwv] 0-management:

> readv on /var/run/6154ed2845b7f728a3acdce9d69e08ee.socket failed (Invalid

> argument)

> [2014-10-28 06:00:21.227976] W [socket.c:545:__socket_rwv] 0-management:

> readv on

>

> .....

> .....

>

> [2014-10-28 06:19:15.142867] I

> [glusterd-handler.c:1280:__glusterd_handle_cli_get_volume] 0-glusterd:

> Received get vol req

> The message "I [MSGID: 106005]

> [glusterd-handler.c:4142:__glusterd_brick_rpc_notify] 0-management: Brick

> 192.168.1.246:/zp2/brick2 has disconnected from glusterd." repeated 12

> times between [2014-10-28 06:18:09.368752] and [2014-10-28 06:18:45.373063]

> [2014-10-28 06:23:38.207649] W [glusterfsd.c:1194:cleanup_and_exit] (-->

> 0-: received signum (15), shutting down

>

>

> dmesg output:

>

> SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has

> encountered an uncorrectable I/O failure and has been suspended.

>

> SPLError: 7868:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has

> encountered an uncorrectable I/O failure and has been suspended.

>

> SPLError: 7869:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'zp2' has

> encountered an uncorrectable I/O failure and has been suspended.

>

> The brick is still online.

>

> # gluster volume status

> Status of volume: repvol

> Gluster process Port Online Pid

> ------------------------------------------------------------------------------

> Brick 192.168.1.246:/zp1/brick1 49152 Y 4067

> Brick 192.168.1.246:/zp2/brick2 49153 Y 4078

> NFS Server on localhost 2049 Y 4092

> Self-heal Daemon on localhost N/A Y 4097

>

> Task Status of Volume repvol

> ------------------------------------------------------------------------------

> There are no active volume tasks

>

> # gluster volume info

>

> Volume Name: repvol

> Type: Replicate

> Volume ID: ba1e7c6d-1e1c-45cd-8132-5f4fa4d2d22b

> Status: Started

> Number of Bricks: 1 x 2 = 2

> Transport-type: tcp

> Bricks:

> Brick1: 192.168.1.246:/zp1/brick1

> Brick2: 192.168.1.246:/zp2/brick2

> Options Reconfigured:

> storage.health-check-interval: 30

>

> Let me know if you need further information.

>

> Thanks,

> Kiran.

>

> On Tue, Oct 28, 2014 at 11:44 AM, Kiran Patil <kiran@xxxxxxxxxxxxx> wrote:

>

> > I changed  git fetch git://review.gluster.org/glusterfs  to git fetch

> > http://review.gluster.org/glusterfs  and now it works.

> >

> > Thanks,

> > Kiran.

> >

> > On Tue, Oct 28, 2014 at 11:13 AM, Kiran Patil <kiran@xxxxxxxxxxxxx> wrote:

> >

> >> Hi Niels,

> >>

> >> I am getting "fatal: Couldn't find remote ref refs/changes/13/8213/9"

> >> error.

> >>

> >> Steps to reproduce the issue.

> >>

> >> 1) # git clone git://review.gluster.org/glusterfs

> >> Initialized empty Git repository in /root/gluster-3.6/glusterfs/.git/

> >> remote: Counting objects: 84921, done.

> >> remote: Compressing objects: 100% (48307/48307), done.

> >> remote: Total 84921 (delta 57264), reused 63233 (delta 36254)

> >> Receiving objects: 100% (84921/84921), 23.23 MiB | 192 KiB/s, done.

> >> Resolving deltas: 100% (57264/57264), done.

> >>

> >> 2) # cd glusterfs

> >>     # git branch

> >>     * master

> >>

> >> 3) # git fetch git://review.gluster.org/glusterfs refs/changes/13/8213/9

> >> && git checkout FETCH_HEAD

> >> fatal: Couldn't find remote ref refs/changes/13/8213/9

> >>

> >> Note: I also tried the above steps on git repo

> >> https://github.com/gluster/glusterfs and the result is same as above.

> >>

> >> Please let me know if I miss any steps.

> >>

> >> Thanks,

> >> Kiran.

> >>

> >> On Mon, Oct 27, 2014 at 5:53 PM, Niels de Vos <ndevos@xxxxxxxxxx> wrote:

> >>

> >>> On Mon, Oct 27, 2014 at 05:19:13PM +0530, Kiran Patil wrote:

> >>> > Hi,

> >>> >

> >>> > I created replicated vol with two bricks on the same node and copied

> >>> some

> >>> > data to it.

> >>> >

> >>> > Now removed the disk which has hosted one of the brick of the volume.

> >>> >

> >>> > Storage.health-check-interval is set to 30 seconds.

> >>> >

> >>> > I could see the disk is unavailable using zpool command of zfs on

> >>> linux but

> >>> > the gluster volume status still displays the brick process running

> >>> which

> >>> > should have been shutdown by this time.

> >>> >

> >>> > Is this a bug in 3.6 since it is mentioned as feature "

> >>> >

> >>> https://github.com/gluster/glusterfs/blob/release-3.6/doc/features/brick-failure-detection.md

> >>> "

> >>> >  or am I doing any mistakes here?

> >>>

> >>> The initial detection of brick failures did not work for all

> >>> filesystems. It may not work for ZFS too. A fix has been posted, but it

> >>> has not been merged into the master branch yet. When the change has been

> >>> merged, it can get backported to 3.6 and 3.5.

> >>>

> >>> You may want to test with the patch applied, and add your "+1 Verified"

> >>> to the change in case it makes it functional for you:

> >>> - http://review.gluster.org/8213

> >>>

> >>> Cheers,

> >>> Niels

> >>>

> >>> >

> >>> > [root@fractal-c92e gluster-3.6]# gluster volume status

> >>> > Status of volume: repvol

> >>> > Gluster process Port Online Pid

> >>> >

> >>> ------------------------------------------------------------------------------

> >>> > Brick 192.168.1.246:/zp1/brick1 49154 Y 17671

> >>> > Brick 192.168.1.246:/zp2/brick2 49155 Y 17682

> >>> > NFS Server on localhost 2049 Y 17696

> >>> > Self-heal Daemon on localhost N/A Y 17701

> >>> >

> >>> > Task Status of Volume repvol

> >>> >

> >>> ------------------------------------------------------------------------------

> >>> > There are no active volume tasks

> >>> >

> >>> >

> >>> > [root@fractal-c92e gluster-3.6]# gluster volume info

> >>> >

> >>> > Volume Name: repvol

> >>> > Type: Replicate

> >>> > Volume ID: d4f992b1-1393-43b8-9fda-2e2b6e3b5039

> >>> > Status: Started

> >>> > Number of Bricks: 1 x 2 = 2

> >>> > Transport-type: tcp

> >>> > Bricks:

> >>> > Brick1: 192.168.1.246:/zp1/brick1

> >>> > Brick2: 192.168.1.246:/zp2/brick2

> >>> > Options Reconfigured:

> >>> > storage.health-check-interval: 30

> >>> >

> >>> > [root@fractal-c92e gluster-3.6]# zpool status zp2

> >>> >   pool: zp2

> >>> >  state: UNAVAIL

> >>> > status: One or more devices are faulted in response to IO failures.

> >>> > action: Make sure the affected devices are connected, then run 'zpool

> >>> > clear'.

> >>> >    see: http://zfsonlinux.org/msg/ZFS-8000-HC

> >>> >   scan: none requested

> >>> > config:

> >>> >

> >>> > NAME        STATE     READ WRITE CKSUM

> >>> > zp2         UNAVAIL      0     0     0  insufficient replicas

> >>> >   sdb       UNAVAIL      0     0     0

> >>> >

> >>> > errors: 2 data errors, use '-v' for a list

> >>> >

> >>> >

> >>> > Thanks,

> >>> > Kiran.

> >>>

> >>> > _______________________________________________

> >>> > Gluster-devel mailing list

> >>> > Gluster-devel@xxxxxxxxxxx

> >>> > http://supercolony.gluster.org/mailman/listinfo/gluster-devel

> >>>

> >>>

> >>

> >

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel