Re: Unavailability during self-heal for large volumes

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Fri, 30 May 2014 02:31:25 -0400 (EDT)

----- Original Message -----
> From: "Laurent Chouinard" <laurent.chouinard@xxxxxxxxxxx>
> To: gluster-users@xxxxxxxxxxx
> Sent: Thursday, May 22, 2014 9:16:01 PM
> Subject:  Unavailability during self-heal for large volumes
> 
> 
> 
> Hi,
> 
> 
> 
> Digging in the archives of this list and bugzilla, it seems that the problem
> I’m about to describe has existed for a long time. However, I am unclear if
> a solution was found or not, so I’d like to get some input from the users
> mailing list.
> 
> 
> 
> For a volume with a very large number of files (several millions), following
> an outage from a node or if we replace a brick and present it empty to the
> cluster, the self-heal system kicks which is the expected behaviour.
> 
> 
> 
> However, during this self-heal, system load is so high that it renders the
> machine unavailable for several hours until it’s complete. On certain
> extreme occasions, it goes so far as to prevent SSH login, and at some point
> we even had to force a reboot to recover a minimum of usability.
> 
> 
> 
> Has anyone found a way to control the load of the self-heal system to a more
> acceptable level? It is my understanding that the issue is caused by the
> very large number of IOPS required by every brick to enumerate all files and
> read metadata flags, then copy data and write changes. The machines are
> quite capable of heavy IO, since disks are all SSDs in raid-0 and multiple
> network links are bonded per machine for more bandwidth.
> 
> 
> 
> I don’t mind the time it takes to heal, I mind the impact healing has over
> other operations.
> 
> 
> 
> Any ideas?

Laurent,
   This has been improved significantly in afr-v2 (enhanced version of replication translator in gluster) which will be released with 3.6 I believe. The issue happens because of the directory self-heal in the older versions. In the new version per file healing in a directory is performed instead of Full directory heal at-once which was creating a lot of traffic. Unfortunately This is too big a change to backport to older releases :-(.

Pranith

> 
> 
> 
> Thanks
> 
> 
> 
> Laurent Chouinard
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users