Re: self-heal not working

Ravishankar N <ravishankar@xxxxxxxxxx> · Mon, 28 Aug 2017 09:28:14 +0530

On 08/28/2017 01:57 AM, Ben Turner wrote:
----- Original Message -----
From: "mabi" <mabi@xxxxxxxxxxxxx>
To: "Ravishankar N" <ravishankar@xxxxxxxxxx>
Cc: "Ben Turner" <bturner@xxxxxxxxxx>, "Gluster Users" <gluster-users@xxxxxxxxxxx>
Sent: Sunday, August 27, 2017 3:15:33 PM
Subject: Re:  self-heal not working

Thanks Ravi for your analysis. So as far as I understand nothing to worry
about but my question now would be: how do I get rid of this file from the
heal info?
Correct me if I am wrong but clearing this is just a matter of resetting the afr.dirty xattr?  @Ravi - Is this correct?

Yes resetting the xattr and launching index heal or running heal-info 
command should serve as a workaround.
-Ravi

-b

-------- Original Message --------
Subject: Re:  self-heal not working
Local Time: August 27, 2017 3:45 PM
UTC Time: August 27, 2017 1:45 PM
From: ravishankar@xxxxxxxxxx
To: mabi <mabi@xxxxxxxxxxxxx>
Ben Turner <bturner@xxxxxxxxxx>, Gluster Users <gluster-users@xxxxxxxxxxx>

Yes, the shds did pick up the file for healing (I saw messages like " got
entry: 1985e233-d5ee-4e3e-a51a-cf0b5f9f2aea") but no error afterwards.

Anyway I reproduced it by manually setting the afr.dirty bit for a zero
byte file on all 3 bricks. Since there are no afr pending xattrs
indicating good/bad copies and all files are zero bytes, the data
self-heal algorithm just picks the file with the latest ctime as source.
In your case that was the arbiter brick. In the code, there is a check to
prevent data heals if arbiter is the source. So heal was not happening and
the entries were not removed from heal-info output.

Perhaps we should add a check in the code to just remove the entries from
heal-info if size is zero bytes in all bricks.

-Ravi

On 08/25/2017 06:33 PM, mabi wrote:

Hi Ravi,

Did you get a chance to have a look at the log files I have attached in my
last mail?

Best,
Mabi

-------- Original Message --------
Subject: Re:  self-heal not working
Local Time: August 24, 2017 12:08 PM
UTC Time: August 24, 2017 10:08 AM
From: mabi@xxxxxxxxxxxxx
To: Ravishankar N
[<ravishankar@xxxxxxxxxx>](mailto:ravishankar@xxxxxxxxxx)
Ben Turner [<bturner@xxxxxxxxxx>](mailto:bturner@xxxxxxxxxx), Gluster
Users [<gluster-users@xxxxxxxxxxx>](mailto:gluster-users@xxxxxxxxxxx)

Thanks for confirming the command. I have now enabled DEBUG
client-log-level, run a heal and then attached the glustershd log files
of all 3 nodes in this mail.

The volume concerned is called myvol-pro, the other 3 volumes have no
problem so far.

Also note that in the mean time it looks like the file has been deleted
by the user and as such the heal info command does not show the file
name anymore but just is GFID which is:

gfid:1985e233-d5ee-4e3e-a51a-cf0b5f9f2aea

Hope that helps for debugging this issue.

-------- Original Message --------
Subject: Re:  self-heal not working
Local Time: August 24, 2017 5:58 AM
UTC Time: August 24, 2017 3:58 AM
From: ravishankar@xxxxxxxxxx
To: mabi [<mabi@xxxxxxxxxxxxx>](mailto:mabi@xxxxxxxxxxxxx)
Ben Turner [<bturner@xxxxxxxxxx>](mailto:bturner@xxxxxxxxxx), Gluster
Users [<gluster-users@xxxxxxxxxxx>](mailto:gluster-users@xxxxxxxxxxx)

Unlikely. In your case only the afr.dirty is set, not the
afr.volname-client-xx xattr.

`gluster volume set myvolume diagnostics.client-log-level DEBUG` is
right.

On 08/23/2017 10:31 PM, mabi wrote:

I just saw the following bug which was fixed in 3.8.15:

https://bugzilla.redhat.com/show_bug.cgi?id=1471613

Is it possible that the problem I described in this post is related to
that bug?

-------- Original Message --------
Subject: Re:  self-heal not working
Local Time: August 22, 2017 11:51 AM
UTC Time: August 22, 2017 9:51 AM
From: ravishankar@xxxxxxxxxx
To: mabi [<mabi@xxxxxxxxxxxxx>](mailto:mabi@xxxxxxxxxxxxx)
Ben Turner [<bturner@xxxxxxxxxx>](mailto:bturner@xxxxxxxxxx), Gluster
Users [<gluster-users@xxxxxxxxxxx>](mailto:gluster-users@xxxxxxxxxxx)

On 08/22/2017 02:30 PM, mabi wrote:

Thanks for the additional hints, I have the following 2 questions
first:

- In order to launch the index heal is the following command correct:
gluster volume heal myvolume
Yes

- If I run a "volume start force" will it have any short disruptions
on my clients which mount the volume through FUSE? If yes, how long?
This is a production system that's why I am asking.
No. You can actually create a test volume on  your personal linux box
to try these kinds of things without needing multiple machines. This
is how we develop and test our patches :)
'gluster volume create testvol replica 3 /home/mabi/bricks/brick{1..3}
force` and so on.

HTH,
Ravi

-------- Original Message --------
Subject: Re:  self-heal not working
Local Time: August 22, 2017 6:26 AM
UTC Time: August 22, 2017 4:26 AM
From: ravishankar@xxxxxxxxxx
To: mabi [<mabi@xxxxxxxxxxxxx>](mailto:mabi@xxxxxxxxxxxxx), Ben
Turner [<bturner@xxxxxxxxxx>](mailto:bturner@xxxxxxxxxx)
Gluster Users
[<gluster-users@xxxxxxxxxxx>](mailto:gluster-users@xxxxxxxxxxx)

Explore the following:

- Launch index heal and look at the glustershd logs of all bricks
for possible errors

- See if the glustershd in each node is connected to all bricks.

- If not try to restart shd by `volume start force`

- Launch index heal again and try.

- Try debugging the shd log by setting client-log-level to DEBUG
temporarily.

On 08/22/2017 03:19 AM, mabi wrote:

Sure, it doesn't look like a split brain based on the output:

Brick node1.domain.tld:/data/myvolume/brick
Status: Connected
Number of entries in split-brain: 0

Brick node2.domain.tld:/data/myvolume/brick
Status: Connected
Number of entries in split-brain: 0

Brick node3.domain.tld:/srv/glusterfs/myvolume/brick
Status: Connected
Number of entries in split-brain: 0

-------- Original Message --------
Subject: Re:  self-heal not working
Local Time: August 21, 2017 11:35 PM
UTC Time: August 21, 2017 9:35 PM
From: bturner@xxxxxxxxxx
To: mabi [<mabi@xxxxxxxxxxxxx>](mailto:mabi@xxxxxxxxxxxxx)
Gluster Users
[<gluster-users@xxxxxxxxxxx>](mailto:gluster-users@xxxxxxxxxxx)

Can you also provide:

gluster v heal <my vol> info split-brain

If it is split brain just delete the incorrect file from the brick
and run heal again. I haven"t tried this with arbiter but I
assume the process is the same.

-b

----- Original Message -----
From: "mabi" [<mabi@xxxxxxxxxxxxx>](mailto:mabi@xxxxxxxxxxxxx)
To: "Ben Turner"
[<bturner@xxxxxxxxxx>](mailto:bturner@xxxxxxxxxx)
Cc: "Gluster Users"
[<gluster-users@xxxxxxxxxxx>](mailto:gluster-users@xxxxxxxxxxx)
Sent: Monday, August 21, 2017 4:55:59 PM
Subject: Re:  self-heal not working

Hi Ben,

So it is really a 0 kBytes file everywhere (all nodes including
the arbiter
and from the client).
Here below you will find the output you requested. Hopefully that
will help
to find out why this specific file is not healing... Let me know
if you need
any more information. Btw node3 is my arbiter node.

NODE1:

STAT:
File:
‘/data/myvolume/brick/data/appdata_ocpom4nckwru/preview/1344699/64-64-crop.png’
Size: 0 Blocks: 38 IO Block: 131072 regular empty file
Device: 24h/36d Inode: 10033884 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 33/www-data) Gid: ( 33/www-data)
Access: 2017-08-14 17:04:55.530681000 +0200
Modify: 2017-08-14 17:11:46.407404779 +0200
Change: 2017-08-14 17:11:46.407404779 +0200
Birth: -

GETFATTR:
trusted.afr.dirty=0sAAAAAQAAAAAAAAAA
trusted.bit-rot.version=0sAgAAAAAAAABZhuknAAlJAg==
trusted.gfid=0sGYXiM9XuTj6lGs8LX58q6g==
trusted.glusterfs.d99af2fa-439b-4a21-bf3a-38f3849f87ec.xtime=0sWZG9sgAGOyo=

NODE2:

STAT:
File:
‘/data/myvolume/brick/data/appdata_ocpom4nckwru/preview/1344699/64-64-crop.png’
Size: 0 Blocks: 38 IO Block: 131072 regular empty file
Device: 26h/38d Inode: 10031330 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 33/www-data) Gid: ( 33/www-data)
Access: 2017-08-14 17:04:55.530681000 +0200
Modify: 2017-08-14 17:11:46.403704181 +0200
Change: 2017-08-14 17:11:46.403704181 +0200
Birth: -

GETFATTR:
trusted.afr.dirty=0sAAAAAQAAAAAAAAAA
trusted.bit-rot.version=0sAgAAAAAAAABZhu6wAA8Hpw==
trusted.gfid=0sGYXiM9XuTj6lGs8LX58q6g==
trusted.glusterfs.d99af2fa-439b-4a21-bf3a-38f3849f87ec.xtime=0sWZG9sgAGOVE=

NODE3:
STAT:
File:
/srv/glusterfs/myvolume/brick/data/appdata_ocpom4nckwru/preview/1344699/64-64-crop.png
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: ca11h/51729d Inode: 405208959 Links: 2
Access: (0644/-rw-r--r--) Uid: ( 33/www-data) Gid: ( 33/www-data)
Access: 2017-08-14 17:04:55.530681000 +0200
Modify: 2017-08-14 17:04:55.530681000 +0200
Change: 2017-08-14 17:11:46.604380051 +0200
Birth: -

GETFATTR:
trusted.afr.dirty=0sAAAAAQAAAAAAAAAA
trusted.bit-rot.version=0sAgAAAAAAAABZe6ejAAKPAg==
trusted.gfid=0sGYXiM9XuTj6lGs8LX58q6g==
trusted.glusterfs.d99af2fa-439b-4a21-bf3a-38f3849f87ec.xtime=0sWZG9sgAGOc4=

CLIENT GLUSTER MOUNT:
STAT:
File:
"/mnt/myvolume/data/appdata_ocpom4nckwru/preview/1344699/64-64-crop.png"
Size: 0 Blocks: 0 IO Block: 131072 regular empty file
Device: 1eh/30d Inode: 11897049013408443114 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 33/www-data) Gid: ( 33/www-data)
Access: 2017-08-14 17:04:55.530681000 +0200
Modify: 2017-08-14 17:11:46.407404779 +0200
Change: 2017-08-14 17:11:46.407404779 +0200
Birth: -

-------- Original Message --------
Subject: Re:  self-heal not working
Local Time: August 21, 2017 9:34 PM
UTC Time: August 21, 2017 7:34 PM
From: bturner@xxxxxxxxxx
To: mabi [<mabi@xxxxxxxxxxxxx>](mailto:mabi@xxxxxxxxxxxxx)
Gluster Users
[<gluster-users@xxxxxxxxxxx>](mailto:gluster-users@xxxxxxxxxxx)

----- Original Message -----
From: "mabi" [<mabi@xxxxxxxxxxxxx>](mailto:mabi@xxxxxxxxxxxxx)
To: "Gluster Users"
[<gluster-users@xxxxxxxxxxx>](mailto:gluster-users@xxxxxxxxxxx)
Sent: Monday, August 21, 2017 9:28:24 AM
Subject:  self-heal not working

Hi,

I have a replicat 2 with arbiter GlusterFS 3.8.11 cluster and
there is
currently one file listed to be healed as you can see below
but never gets
healed by the self-heal daemon:

Brick node1.domain.tld:/data/myvolume/brick
/data/appdata_ocpom4nckwru/preview/1344699/64-64-crop.png
Status: Connected
Number of entries: 1

Brick node2.domain.tld:/data/myvolume/brick
/data/appdata_ocpom4nckwru/preview/1344699/64-64-crop.png
Status: Connected
Number of entries: 1

Brick node3.domain.tld:/srv/glusterfs/myvolume/brick
/data/appdata_ocpom4nckwru/preview/1344699/64-64-crop.png
Status: Connected
Number of entries: 1

As once recommended on this mailing list I have mounted that
glusterfs
volume
temporarily through fuse/glusterfs and ran a "stat" on that
file which is
listed above but nothing happened.

The file itself is available on all 3 nodes/bricks but on the
last node it
has a different date. By the way this file is 0 kBytes big. Is
that maybe
the reason why the self-heal does not work?
Is the file actually 0 bytes or is it just 0 bytes on the
arbiter(0 bytes
are expected on the arbiter, it just stores metadata)? Can you
send us the
output from stat on all 3 nodes:

$ stat <file on back end brick>
$ getfattr -d -m - <file on back end brick>
$ stat <file from gluster mount>

Lets see what things look like on the back end, it should tell
us why
healing is failing.

-b

And how can I now make this file to heal?

Thanks,
Mabi

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users