Re: GlusterFS cluster stalls if one server from the cluster goes down and then comes back up

Dimitar Ianakiev <dimitar.q@xxxxxxxxxxxxxx> · Wed, 23 Mar 2016 18:44:31 +0200

Hello,

On 03/23/2016 06:35 PM, Ravishankar N wrote:
> On 03/23/2016 09:53 PM, Marian Marinov wrote:
>>> >What version of gluster is this?
>> 3.7.6
>>
>>> >Do you observe the problem even when only the 4th 'non data' server
>>> comes up? In that case it is unlikely that self-heal is the issue.
>> No
>>
>>> >Are the clients using FUSE or NFS mounts?
>> FUSE
>>
> 
> Okay, when the you say the cluster stalls, I'm assuming the apps using
> files via the fuse mount are stalled. Does the mount log contain
> messages about completing selfheals on files when the mount eventually
> becomes responsive?If yes, you could try setting
> 'cluster.data-self-heal' to off.

Yes we have many lines with similar entries in the logs:

[2016-03-22 11:10:23.398668] I [MSGID: 108026]
[afr-self-heal-common.c:651:afr_log_selfheal] 0-share-replicate-0:
Completed data selfheal on b18c2b05-7186-4c22-ab34-24858b1153e5.
source=0 sinks=2

[2016-03-23 13:11:54.110773] I [MSGID: 108026]
[afr-self-heal-common.c:651:afr_log_selfheal] 0-share-replicate-0:
Completed metadata selfheal on 591d2bee-b55c-4dd6-a1bc-8b7fc5571caa.
source=0 sinks=

We already tested setting cluster.self-heal-daemon off and we did not
experience the issue in this case. We stopped one node, disabled
self-heal-daemon, started the node and later enabled the
self-heal-daemon. There was no "stalling" in this case.

We will try the suggested setting too.

-- 
Dimitar Ianakiev
System Administrator
www.siteground.com

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users