[Gluster-devel] heal-failed on 3.5.2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I had an instance of heal-failed today on a 3x2 replicated volume with 17TB on ubuntu 12.04 xfs bricks running gluster 3.5.2

Initially:

on the brick log:

Warnings in /var/log/glusterfs/glustershd.log
.2014-09-25 15:56:10.200387] E [afr-self-heal-common.c:1615:afr_sh_common_lookup_cbk] 0-sas03-replicate-0: Conflicting entries for /RdB2C_20140917.dat
.2014-09-25 15:56:10.653858] E [afr-self-heal-common.c:1615:afr_sh_common_lookup_cbk] 0-sas03-replicate-0: Conflicting entries for /RdB2C_20140917.dat

which staying a file having conflich. NO split-brain detected but a heal-failed:
root@glusterprod001:~# gluster volume heal sas03 info heal-failed
Gathering list of heal failed entries on volume sas03 has been successful

Brick glusterprod001.shopzilla.laxhq:/brick03/gfs
Number of entries: 2
at                    path on brick
-----------------------------------
2014-09-25 15:56:10 /HypDataSata03/data/RdctB2C
2014-09-25 16:06:09 /HypDataSata03/data/RdctB2C

Brick glusterprod002.shopzilla.laxhq:/brick03/gfs
Number of entries: 1
at                    path on brick
-----------------------------------
2014-09-25 15:58:37 /HypDataSata03//data//RdctB2C

Brick glusterprod003.shopzilla.laxhq:/brick03/gfs
Number of entries: 0

Brick glusterprod004.shopzilla.laxhq:/brick03/gfs
Number of entries: 0

Brick glusterprod005.shopzilla.laxhq:/brick03/gfs
Number of entries: 0

Brick glusterprod006.shopzilla.laxhq:/brick03/gfs
Number of entries: 0
Noticed it stays the directory heal-failed instead of the file.

Gluster clients sees the error on the file with an invalid file while doing ls against it.

Then:

I tried to restart glusterfs-server on both prod001 and prod002 as that's how I used to resolve the heal-failed.

and it became like this:

root@glusterprod001:~# gluster volume heal sas03 info heal-failed
Gathering list of heal failed entries on volume sas03 has been successful

Brick glusterprod001.shopzilla.laxhq:/brick03/gfs
Number of entries: 2
at                    path on brick
-----------------------------------
2014-09-25 16:17:51 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>
2014-09-25 16:17:53 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>

Brick glusterprod002.shopzilla.laxhq:/brick03/gfs
Number of entries: 3
at                    path on brick
-----------------------------------
2014-09-25 16:15:43 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>
2014-09-25 16:15:44 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>
2014-09-25 16:17:51 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>

Seems like the folder turns into gfid

And then:
I identified the file on the brick and removed the invalid copy then issue a volume heal

# gluster volume heal sas03

This fixed the client access to the file

but info heal-failed got this:
root@glusterprod001:/# gluster volume heal sas03 info heal-failed
Gathering list of heal failed entries on volume sas03 has been successful

Brick glusterprod001.shopzilla.laxhq:/brick03/gfs
Number of entries: 3
at                    path on brick
-----------------------------------
2014-09-25 16:17:51 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>
2014-09-25 16:17:53 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>
2014-09-25 16:27:53 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>

Brick glusterprod002.shopzilla.laxhq:/brick03/gfs
Number of entries: 5
at                    path on brick
-----------------------------------
2014-09-25 16:15:43 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>
2014-09-25 16:15:44 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>
2014-09-25 16:17:51 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>
2014-09-25 16:25:44 <gfid:9ec801a3-53d4-4d98-b950-14211920694e>
2014-09-25 16:34:11 /HypDataSata03/data/RdctB2C

Brick glusterprod003.shopzilla.laxhq:/brick03/gfs
Number of entries: 0

Brick glusterprod004.shopzilla.laxhq:/brick03/gfs
Number of entries: 0

Brick glusterprod005.shopzilla.laxhq:/brick03/gfs
Number of entries: 0

Brick glusterprod006.shopzilla.laxhq:/brick03/gfs
Number of entries: 0

which has all the gfid and the directory showed up on the heal-failed

Finally:

I restarted glusterfs-server on both prod001 and prod002 and that cleared the heal-failed entries

Should there be a better way to resolve the heal-failed and file conflict?

Thanks
Peter

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux