Re: AFR Version used for self-heal

Ravishankar N <ravishankar@xxxxxxxxxx> · Fri, 26 Feb 2016 11:18:41 +0530

On 02/26/2016 10:02 AM, Kyle Maas wrote:
On 02/25/2016 08:20 PM, Ravishankar N wrote:
On 02/25/2016 11:36 PM, Kyle Maas wrote:
How can I tell what AFR version a cluster is using for self-heal?
If all your servers and clients are 3.7.8, then they are by default
running afr-v2.  Afr-v2 was a re-write of afr that went in for 3.6.,
so any gluster package from then on has this code, you don't need to
explicitly enable anything.
That was what I thought until I ran across this IRC log where JoeJulian
asked if it was explicitly enabled:

https://irclog.perlgeek.de/gluster/2015-10-29

The reason I ask is that I have a two-node replicated 3.7.8 cluster (no
arbiters) which has locking behavior during self-heal which looks very
similar to that of AFRv1 (only heals one file at a time per self-heal
daemon, appears to lock the full inode while it's healing it instead of
just ranges, etc.),
  Both v1 and v2 use range locks while healing a given file, so clients
shouldn't block when heals happen. What is the problem you're facing?
Are your clients also at 3.7.8?
Primary symptoms are:

1. While a self-heal is running, only one file at a time is healed per
brick.  As I understand it, AFRv2 and up should allow for multiple files
to be healed concurrently or at least multiple ranges within a file,
particularly with io-thread-count set to >1.  During a self-heal,
neither I/O nor network is saturated, which leads me to believe that I'm
looking at a single synchronous self-healing process.
The self-heal daemon on each node processes one file at a time per 
replica, so in that sense it is serial. We are  working on the 
multi-threaded self heal patch (http://review.gluster.org/#/c/13329/) 
for parallel heals.

3. More troubling is that during a self-heal, clients cannot so much as
list the files on the volume until the self-heal is done.  No errors.
No timeouts.  They just freeze.  As soon as the self-heal is complete,
they unfreeze and list the contents.
I'm guessing http://review.gluster.org/#/c/13207/ would fix that. But as 
a work around, can you see if  'gluster vol set volname data-self-heal 
off` makes them more responsive?

4. Any file access during a self-heal also freezes, just like a
directory listing, until the self-heal is done.
Ditto as above, please see if disabling client-side heal helps.

Regards,
Ravi

This wreaks havoc on
users who have files open when one of the bricks is rebooted and has to
be healed, since with as much data is stored on this cluster, a
self-heal can take almost 24 hours.

I experience the same problems when I run without any clients other than
the bricks themselves mounting the volume, so yes, it happens with the
clients on 3.7.8 as well.

Warm Regards,
Kyle Maas

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users