Re: Self-heal still not finished after 2 days

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Mon, 30 Jun 2014 07:28:59 +0530

Oops, I see you are the same user who posted about VM files self-heal. 
Sorry I couldn't get back in time. So you are using 3.4.2.
Could you post logfiles of mount, bricks please. That should help us to 
find more information about any issues.

gluster volume heal <volname> info heal-failed records the last 1024 
failures. It also prints the timestamp of when the failures occurred. 
Even after the heal is successful it keeps showing the errors. So 
timestamp of when the heal failed is important. Because some of these 
commands are causing such confusion we depracated these commands in 
upcoming releases (3.6).

This is probably a stupid question but let me ask it anyway. When a 
brick contents are erased from backend
we need to make sure about the following two things:
1) Extended attributes of the root brick is showing pending operations 
on the brick that is erased
2) Execute "gluster volume heal <volname> full"

Did you do the steps above?

Since you are on 3.4.2 I think best way to check what files are healed 
is using extended attributes in the backend. Could you please post them 
again.

Pranith

On 06/30/2014 07:12 AM, Pranith Kumar Karampuri wrote:

On 06/30/2014 04:03 AM, John Gardeniers wrote:
Hi All,

We have 2 servers, each with on 5TB brick, configured as replica 2.
After a series of events that caused the 2 bricks to become way out of
step gluster was turned off on one server and its brick was wiped of
everything but the attributes were untouched.

This weekend we stopped the client and gluster and made a backup of the
remaining brick, just to play safe. Gluster was then turned back on,
first on the "master" and then on the "slave". Self-heal kicked in and
started rebuilding the second brick. However, after 2 full days all
files in the volume are still showing heal failed errors.

The rebuild was, in my opinion at least, very slow, taking most of a day
even though the system is on a 10Gb LAN. The data is a little under
1.4TB committed, 2TB allocated.
How much more to be healed? 0.6TB?

Once the 2 bricks were very close to having the same amount of space
used things slowed right down. For the last day both bricks show a very
slow increase in used space, even though there are no changes being
written by the client. By slow I mean just a few KB per minute.
Is the I/O still in progress on the mount? Self-heal doesn't happen on 
files where I/O is going on mounts in 3.4.x. So that could be the 
reason if I/O is going on.

The logs are confusing, to say the least. In
etc-glusterfs-glusterd.vol.log on both servers there are thousands of
entries such as (possibly because I was using watch to monitor self-heal
progress):

[2014-06-29 21:41:11.289742] I
[glusterd-volume-ops.c:478:__glusterd_handle_cli_heal_volume]
0-management: Received heal vol req for volume gluster-rhev
What versoin of gluster are you using?
That timestamp is the latest on either server, that's about 9 hours ago
as I type this. I find that a bit disconcerting. I have requested volume
heal-failed info since then.

The brick log on the "master" server (the one from which we are
rebuilding the new brick) contains no entries since before the rebuild
started.

On the "slave" server the brick log shows a lot of entries such as:

[2014-06-28 08:49:47.887353] E [marker.c:2140:marker_removexattr_cbk]
0-gluster-rhev-marker: Numerical result out of range occurred while
creating symlinks
[2014-06-28 08:49:47.887382] I
[server-rpc-fops.c:745:server_removexattr_cbk] 0-gluster-rhev-server:
10311315: REMOVEXATTR
/44d30b24-1ed7-48a0-b905-818dc0a006a2/images/02d4bd3c-b057-4f04-ada5-838f83d0b761/d962466d-1894-4716-b5d0-3a10979145ec 

(1c1f53ac-afe2-420d-8c93-b1eb53ffe8b1) of key  ==> (Numerical result out
of range)
CC Raghavendra who knows about marker translator.

Those entries are around the time the rebuild was starting. The final
entries in that same log (immediately after those listed above) are:

[2014-06-29 12:47:28.473999] I
[server-rpc-fops.c:243:server_inodelk_cbk] 0-gluster-rhev-server: 2869:
INODELK (null) (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file
or directory)
[2014-06-29 12:47:28.489527] I [server-rpc-fops.c:1572:server_open_cbk]
0-gluster-rhev-server: 2870: OPEN (null)
(c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file or directory)
These logs are harmless and were fixed in 3.5 I think. Are you on 3.4.x?

As I type it's 2014-06-30 08:31.

What do they mean and how can I rectify it?

regards,
John

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users