Heal completed but I will try this by simulating a disk fail in cluster and reply to you. Thanks for the help. On Thu, Aug 11, 2016 at 9:52 AM, Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote: > > > On Fri, Aug 5, 2016 at 8:37 PM, Serkan Çoban <cobanserkan@xxxxxxxxx> wrote: >> >> Hi again, >> >> I am seeing the above situation in production environment now. >> One disk on one of my servers broken. I killed the brick process, >> replace the disk, mount it and then I do a gluster v start force. >> >> For a 24 hours period after replacing disks I see below gluster v >> heal info count increased until 200.000 >> >> gluster v heal v0 info | grep "Number of entries" | grep -v "Number of >> entries: 0" >> Number of entries: 205117 >> Number of entries: 205231 >> ... >> ... >> ... >> >> For about 72 hours It decreased to 40K, and it is going very slowly right >> now. >> What I am observing is very very slow heal speed. There is no errors >> in brick logs. >> There was 900GB data in broken disk and now I see 200GB healed after >> 96 hours after replacing disk. >> There are below warnings in glustershd.log but I think they are harmless. >> >> W [ec_combine.c:866:ec_combine_check] 0-v0-disperse-56: Mismatching >> xdata in answers of LOOKUP >> W [ec_common.c:116:ec_check_status] 0-v0-disperse-56: Operation failed >> on some subvolumes (up=FFFFF, mask=FFFFF, remaining=0, good=FFFF7, >> bad=8) >> W [ec_common.c:71:ec_heal_report] 0-v0-disperse-56: Heal failed >> [invalid argument] >> >> I tried turning on performance.client-io-threads but it did not >> changed anything. >> For 900GB data It will take nearly 8 days to heal. What can I do? > > > Sorry for the delay in response, do you still have this problem? > You can trigger heals using the following command: > > find <dir-you-are-interested> -d -exec getfattr -h -n trusted.ec.heal {} \; > > If you have 10 top level directories may be you can spawn 10 such processes. > > >> >> >> Serkan >> >> >> >> On Fri, Apr 15, 2016 at 1:28 PM, Serkan Çoban <cobanserkan@xxxxxxxxx> >> wrote: >> > 100TB is newly created files when brick is down.I rethink the >> > situation and realized that I reformatted all the bricks in case 1 so >> > write speed limit is 26*100MB/disk >> > In case 2 I just reformatted one brick so write speed limited to >> > 100MB/disk...I will repeat the tests using one brick in both cases >> > once with reformat, and once with just killing brick process... >> > Thanks for reply.. >> > >> > On Fri, Apr 15, 2016 at 9:27 AM, Xavier Hernandez >> > <xhernandez@xxxxxxxxxx> wrote: >> >> Hi Serkan, >> >> >> >> sorry for the delay, I'm a bit busy lately. >> >> >> >> On 13/04/16 13:59, Serkan Çoban wrote: >> >>> >> >>> Hi Xavier, >> >>> >> >>> Can you help me about the below issue? How can I increase the disperse >> >>> heal speed? >> >> >> >> >> >> It seems weird. Is there any related message in the logs ? >> >> >> >> In this particular test, are the 100TB modified files or newly created >> >> files >> >> while the brick was down ? >> >> >> >> How many files have been modified ? >> >> >> >>> Also I would be grateful if you have detailed documentation about >> >>> disperse >> >>> heal, >> >>> why heal happens on disperse volume, how it is triggered? Which nodes >> >>> participate in heal process? Any client interaction? >> >> >> >> >> >> Heal process is basically the same used for replicate. There are two >> >> ways to >> >> trigger a self-heal: >> >> >> >> * when an inconsistency is detected, the client initiates a background >> >> self-heal of the inode >> >> >> >> * the self-heal daemon scans the lists of modified files created by the >> >> index xlator when a modification is made while some node is down. All >> >> these >> >> files are self-healed. >> >> >> >> Xavi >> >> >> >> >> >>> >> >>> Serkan >> >>> >> >>> >> >>> ---------- Forwarded message ---------- >> >>> From: Serkan Çoban <cobanserkan@xxxxxxxxx> >> >>> Date: Fri, Apr 8, 2016 at 5:46 PM >> >>> Subject: disperse heal speed up >> >>> To: Gluster Users <gluster-users@xxxxxxxxxxx> >> >>> >> >>> >> >>> Hi, >> >>> >> >>> I am testing heal speed of disperse volume and what I see is 5-10MB/s >> >>> per >> >>> node. >> >>> I increased disperse.background-heals to 32 and >> >>> disperse.heal-wait-qlength to 256, but still no difference. >> >>> One thing I noticed is that, when I kill a brick process, reformat it >> >>> and restart it heal speed is nearly 20x (200MB/s/node) >> >>> >> >>> But when I kill the brick, then write 100TB data, and start brick >> >>> afterwords heal is slow (5-10MB/s/node) >> >>> >> >>> What is the difference between two scenarios? Why one heal is slow and >> >>> other is fast? How can I increase disperse heal speed? Should I >> >>> increase thread count to 128 or 256? I am on 78x(16+4) disperse volume >> >>> and my servers are pretty strong (2x14 cores with 512GB ram, each node >> >>> has 26x8TB disks) >> >>> >> >>> Gluster version is 3.7.10. >> >>> >> >>> Thanks, >> >>> Serkan >> >>> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@xxxxxxxxxxx >> http://www.gluster.org/mailman/listinfo/gluster-users > > > > > -- > Pranith _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users