>Is it possible that this matches your observations ? Yes that matches what I see. So 19 files is being in parallel by 19 SHD processes. I thought only one file is being healed at a time. Then what is the meaning of disperse.shd-max-threads parameter? If I set it to 2 then each SHD thread will heal two files at the same time? >How many IOPS can handle your bricks ? Bricks are 7200RPM NL-SAS disks. 70-80 random IOPS max. But write pattern seems sequential, 30-40MB bulk writes every 4-5 seconds. This is what iostat shows. >Do you have a test environment where we could check all this ? Not currently but will have in 4-5 weeks. New servers are arriving, I will add this test to my notes. > There's a feature to allow to configure the self-heal block size to optimize these cases. The feature is available on 3.11. I did not see this in 3.11 release notes, what parameter name I should look for? On Thu, Jun 1, 2017 at 10:30 AM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote: > Hi Serkan, > > On 30/05/17 10:22, Serkan Çoban wrote: >> >> Ok I understand that heal operation takes place on server side. In >> this case I should see X KB >> out network traffic from 16 servers and 16X KB input traffic to the >> failed brick server right? So that process will get 16 chunks >> recalculate our chunk and write it to disk. > > > That should be the normal operation for a single heal. > >> The problem is I am not seeing such kind of traffic on servers. In my >> configuration (16+4 EC) I see 20 servers are all have 7-8MB outbound >> traffic and none of them has more than 10MB incoming traffic. >> Only heal operation is happening on cluster right now, no client/other >> traffic. I see constant 7-8MB write to healing brick disk. So where is >> the missing traffic? > > > Not sure about your configuration, but probably you are seeing the result of > having the SHD of each server doing heals. That would explain the network > traffic you have. > > Suppose that all SHD but the one on the damaged brick are working. In this > case 19 servers will peek 16 fragments each. This gives 19 * 16 = 304 > fragments to be requested. EC balances the reads among all available > servers, and there's a chance (1/19) that a fragment is local to the server > asking it. So we'll need a total of 304 - 304 / 19 = 288 network requests, > 288 / 19 = 15.2 sent by each server. > > If we have a total of 288 requests, it means that each server will answer > 288 / 19 = 15.2 requests. The net effect of all this is that each healthy > server is sending 15.2*X bytes of data and each server is receiving 15.2*X > bytes of data. > > Now we need to account for the writes to the damaged brick. We have 19 > simultaneous heals. This means that the damaged brick will receive 19*X > bytes of data, and each healthy server will send X additional bytes of data. > > So: > > A healthy server receives 15.2*X bytes of data > A healthy server sends 16.2*X bytes of data > A damaged server receives 19*X bytes of data > A damaged server sends few bytes of data (communication and synchronization > overhead basically) > > As you can see, in this configuration each server has almost the same amount > of inbound and outbound traffic. Only big difference is the damaged brick, > that should receive a little more of traffic, but it should send much less. > > Is it possible that this matches your observations ? > > There's one more thing to consider here, and it's the apparent low > throughput of self-heal. One possible thing to check is the small size and > random behavior of the requests. > > Assuming that each request has a size of ~128 / 16 = 8KB, at a rate of ~8 > MB/s the servers are processing ~1000 IOPS. Since requests are going to 19 > different files, even if each file is accessed sequentially, the real effect > will be like random access (some read-ahead on the filesystem can improve > reads a bit, but writes won't benefit so much). > > How many IOPS can handle your bricks ? > > Do you have a test environment where we could check all this ? if possible > it would be interesting to have only a single SHD (kill all SHD from all > servers but one). In this situation, without client accesses, we should see > the 16/1 ratio of reads vs writes on the network. We should also see a > similar of even a little better speed because all reads and writes will be > sequential, optimizing available IOPS. > > There's a feature to allow to configure the self-heal block size to optimize > these cases. The feature is available on 3.11. > > Best regards, > > Xavi > > >> >> On Tue, May 30, 2017 at 10:25 AM, Ashish Pandey <aspandey@xxxxxxxxxx> >> wrote: >>> >>> >>> When we say client side heal or server side heal, we basically talking >>> about >>> the side which "triggers" heal of a file. >>> >>> 1 - server side heal - shd scans indices and triggers heal >>> >>> 2 - client side heal - a fop finds that file needs heal and it triggers >>> heal >>> for that file. >>> >>> Now, what happens when heal gets triggered. >>> In both the cases following functions takes part - >>> >>> ec_heal => ec_heal_throttle=>ec_launch_heal >>> >>> Now ec_launch_heal just creates heal tasks (with ec_synctask_heal_wrap >>> which >>> calls ec_heal_do ) and put it into a queue. >>> This happens on server and "syncenv" infrastructure which is nothing but >>> a >>> set of workers pick these tasks and execute it. That is when actual >>> read/write for >>> heal happens. >>> >>> >>> ________________________________ >>> From: "Serkan Çoban" <cobanserkan@xxxxxxxxx> >>> To: "Ashish Pandey" <aspandey@xxxxxxxxxx> >>> Cc: "Gluster Users" <gluster-users@xxxxxxxxxxx> >>> Sent: Monday, May 29, 2017 6:44:50 PM >>> Subject: Re: Heal operation detail of EC volumes >>> >>> >>>>> Healing could be triggered by client side (access of file) or server >>>>> side >>>>> (shd). >>>>> However, in both the cases actual heal starts from "ec_heal_do" >>>>> function. >>> >>> If I do a recursive getfattr operation from clients, then all heal >>> operation is done on clients right? Client read the chunks, calculate >>> and write the missing chunk. >>> And If I don't access files from client then SHD daemons will start >>> heal and read,calculate,write the missing chunks right? >>> >>> In first case EC calculations takes places in client fuse process, in >>> second case EC calculations will be made in SHD process right? >>> Does brick process has any role in EC calculations? >>> >>> On Mon, May 29, 2017 at 3:32 PM, Ashish Pandey <aspandey@xxxxxxxxxx> >>> wrote: >>>> >>>> >>>> >>>> ________________________________ >>>> From: "Serkan Çoban" <cobanserkan@xxxxxxxxx> >>>> To: "Gluster Users" <gluster-users@xxxxxxxxxxx> >>>> Sent: Monday, May 29, 2017 5:13:06 PM >>>> Subject: Heal operation detail of EC volumes >>>> >>>> Hi, >>>> >>>> When a brick fails in EC, What is the healing read/write data path? >>>> Which processes do the operations? >>>> >>>> Healing could be triggered by client side (access of file) or server >>>> side >>>> (shd). >>>> However, in both the cases actual heal starts from "ec_heal_do" >>>> function. >>>> >>>> >>>> Assume a 2GB file is being healed in 16+4 EC configuration. I was >>>> thinking that SHD deamon on failed brick host will read 2GB from >>>> network and reconstruct its 100MB chunk and write it on to brick. Is >>>> this right? >>>> >>>> You are correct about read/write. >>>> The only point is that, SHD deamon on one of the good brick will pick >>>> the >>>> index entry and heal it. >>>> SHD deamon scans the .glusterfs/index directory and heals the entries. >>>> If >>>> the brick went down while IO was going on, index will be present on >>>> killed >>>> brick also. >>>> However, if a brick was down and then you started writing on a file then >>>> in >>>> this case index entry would not be present on killed brick. >>>> So even after brick will be UP, sdh on that brick will not be able to >>>> find >>>> it out this index. However, other bricks would have entries and shd on >>>> that >>>> brick will heal it. >>>> >>>> Note: I am considering each brick on different node. >>>> >>>> Ashish >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users@xxxxxxxxxxx >>>> http://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users@xxxxxxxxxxx >>> http://lists.gluster.org/mailman/listinfo/gluster-users >>> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@xxxxxxxxxxx >> http://lists.gluster.org/mailman/listinfo/gluster-users >> > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users