Apologies for the format beliw, I've been getting emails in digest format, I had to c/p email below :
We have been observing same issue with glusterfs-3.7.8-1.el7.x86_64, but our investigation shows following :
- Even after SHD is turned off (self heal daemon) auto-heal process still causes same amount of CPU activity
- This happens with NFS mounts as well, we've tried both FUSE and NFS v3
- It gets triggered when adding new nodes/bricks to replicated (or replicated+distributed) volumes
- It gets especially triggered when autoheal (or SHD as a matter of fact) dives into a directory with 300K+ files in them.
- We've used strace/sdig and other methods, culprit seems like hard coded SHD or AutoHeal threads. We've observed SHD uses 4 threads per brick, and we couldn't find any config way to reduce thread counts.
- We've tried CPU pinning and others, this didn't solve our root cause, but helped to reduce CPU load on our servers. Without CPU pinning, load goes as high as 100 on 16 core/32CPU w/ HT servers.
Please let us know if this seems like a bug that needs to be filed.
Thank you
Hello, On 03/23/2016 06:35 PM, Ravishankar N wrote: > On 03/23/2016 09:53 PM, Marian Marinov wrote: >>> >What version of gluster is this? >> 3.7.6 >> >>> >Do you observe the problem even when only the 4th 'non data' server >>> comes up? In that case it is unlikely that self-heal is the issue. >> No >> >>> >Are the clients using FUSE or NFS mounts? >> FUSE >> > > Okay, when the you say the cluster stalls, I'm assuming the apps using > files via the fuse mount are stalled. Does the mount log contain > messages about completing selfheals on files when the mount eventually > becomes responsive?If yes, you could try setting > 'cluster.data-self-heal' to off. Yes we have many lines with similar entries in the logs: [2016-03-22 11:10:23.398668] I [MSGID: 108026] [afr-self-heal-common.c:651:afr_log_selfheal] 0-share-replicate-0: Completed data selfheal on b18c2b05-7186-4c22-ab34-24858b1153e5. source=0 sinks=2 [2016-03-23 13:11:54.110773] I [MSGID: 108026] [afr-self-heal-common.c:651:afr_log_selfheal] 0-share-replicate-0: Completed metadata selfheal on 591d2bee-b55c-4dd6-a1bc-8b7fc5571caa. source=0 sinks= We already tested setting cluster.self-heal-daemon off and we did not experience the issue in this case. We stopped one node, disabled self-heal-daemon, started the node and later enabled the self-heal-daemon. There was no "stalling" in this case. We will try the suggested setting too. -- Dimitar Ianakiev System Administrator www.siteground.com
Kayra Otaner
BilgiO A.Ş. - SecOps Experts
PGP KeyID : A945251E | Manager, Enterprise Linux Solutions
www.bilgio.com | TR +90 (532) 111-7240 x 1001 | US +1 (201) 206-2592
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users