Re: Heal operation detail of EC volumes

Serkan Çoban <cobanserkan@xxxxxxxxx> · Thu, 1 Jun 2017 22:31:04 +0300



>Is it possible that this matches your observations ?
Yes that matches what I see. So 19 files is being in parallel by 19
SHD processes. I thought only one file is being healed at a time.
Then what is the meaning of disperse.shd-max-threads parameter? If I
set it to 2 then each SHD thread will heal two files at the same time?

>How many IOPS can handle your bricks ?
Bricks are 7200RPM NL-SAS disks. 70-80 random IOPS max. But write
pattern seems sequential, 30-40MB bulk writes every 4-5 seconds.
This is what iostat shows.

>Do you have a test environment where we could check all this ?
Not currently but will have in 4-5 weeks. New servers are arriving, I
will add this test to my notes.

> There's a feature to allow to configure the self-heal block size to optimize these cases. The feature is available on 3.11.
I did not see this in 3.11 release notes, what parameter name I should look for?


On Thu, Jun 1, 2017 at 10:30 AM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:
> Hi Serkan,
>
> On 30/05/17 10:22, Serkan Çoban wrote:
>>
>> Ok I understand that heal operation takes place on server side. In
>> this case I should see X KB
>>  out network traffic from 16 servers and 16X KB input traffic to the
>> failed brick server right? So that process will get 16 chunks
>> recalculate our chunk and write it to disk.
>
>
> That should be the normal operation for a single heal.
>
>> The problem is I am not seeing such kind of traffic on servers. In my
>> configuration (16+4 EC) I see 20 servers are all have 7-8MB outbound
>> traffic and none of them has more than 10MB incoming traffic.
>> Only heal operation is happening on cluster right now, no client/other
>> traffic. I see constant 7-8MB write to healing brick disk. So where is
>> the missing traffic?
>
>
> Not sure about your configuration, but probably you are seeing the result of
> having the SHD of each server doing heals. That would explain the network
> traffic you have.
>
> Suppose that all SHD but the one on the damaged brick are working. In this
> case 19 servers will peek 16 fragments each. This gives 19 * 16 = 304
> fragments to be requested. EC balances the reads among all available
> servers, and there's a chance (1/19) that a fragment is local to the server
> asking it. So we'll need a total of 304 - 304 / 19 = 288 network requests,
> 288 / 19 = 15.2 sent by each server.
>
> If we have a total of 288 requests, it means that each server will answer
> 288 / 19 = 15.2 requests. The net effect of all this is that each healthy
> server is sending 15.2*X bytes of data and each server is receiving 15.2*X
> bytes of data.
>
> Now we need to account for the writes to the damaged brick. We have 19
> simultaneous heals. This means that the damaged brick will receive 19*X
> bytes of data, and each healthy server will send X additional bytes of data.
>
> So:
>
> A healthy server receives 15.2*X bytes of data
> A healthy server sends 16.2*X bytes of data
> A damaged server receives 19*X bytes of data
> A damaged server sends few bytes of data (communication and synchronization
> overhead basically)
>
> As you can see, in this configuration each server has almost the same amount
> of inbound and outbound traffic. Only big difference is the damaged brick,
> that should receive a little more of traffic, but it should send much less.
>
> Is it possible that this matches your observations ?
>
> There's one more thing to consider here, and it's the apparent low
> throughput of self-heal. One possible thing to check is the small size and
> random behavior of the requests.
>
> Assuming that each request has a size of ~128 / 16 = 8KB, at a rate of ~8
> MB/s the servers are processing ~1000 IOPS. Since requests are going to 19
> different files, even if each file is accessed sequentially, the real effect
> will be like random access (some read-ahead on the filesystem can improve
> reads a bit, but writes won't benefit so much).
>
> How many IOPS can handle your bricks ?
>
> Do you have a test environment where we could check all this ? if possible
> it would be interesting to have only a single SHD (kill all SHD from all
> servers but one). In this situation, without client accesses, we should see
> the 16/1 ratio of reads vs writes on the network. We should also see a
> similar of even a little better speed because all reads and writes will be
> sequential, optimizing available IOPS.
>
> There's a feature to allow to configure the self-heal block size to optimize
> these cases. The feature is available on 3.11.
>
> Best regards,
>
> Xavi
>
>
>>
>> On Tue, May 30, 2017 at 10:25 AM, Ashish Pandey <aspandey@xxxxxxxxxx>
>> wrote:
>>>
>>>
>>> When we say client side heal or server side heal, we basically talking
>>> about
>>> the side which "triggers" heal of a file.
>>>
>>> 1 - server side heal - shd scans indices and triggers heal
>>>
>>> 2 - client side heal - a fop finds that file needs heal and it triggers
>>> heal
>>> for that file.
>>>
>>> Now, what happens when heal gets triggered.
>>> In both  the cases following functions takes part -
>>>
>>> ec_heal => ec_heal_throttle=>ec_launch_heal
>>>
>>> Now ec_launch_heal just creates heal tasks (with ec_synctask_heal_wrap
>>> which
>>> calls ec_heal_do ) and put it into a queue.
>>> This happens on server and "syncenv" infrastructure which is nothing but
>>> a
>>> set of workers pick these tasks and execute it. That is when actual
>>> read/write for
>>> heal happens.
>>>
>>>
>>> ________________________________
>>> From: "Serkan Çoban" <cobanserkan@xxxxxxxxx>
>>> To: "Ashish Pandey" <aspandey@xxxxxxxxxx>
>>> Cc: "Gluster Users" <gluster-users@xxxxxxxxxxx>
>>> Sent: Monday, May 29, 2017 6:44:50 PM
>>> Subject: Re:  Heal operation detail of EC volumes
>>>
>>>
>>>>> Healing could be triggered by client side (access of file) or server
>>>>> side
>>>>> (shd).
>>>>> However, in both the cases actual heal starts from "ec_heal_do"
>>>>> function.
>>>
>>> If I do a recursive getfattr operation from clients, then all heal
>>> operation is done on clients right? Client read the chunks, calculate
>>> and write the missing chunk.
>>> And If I don't access files from client then SHD daemons will start
>>> heal and read,calculate,write the missing chunks right?
>>>
>>> In first case EC calculations takes places in client fuse process, in
>>> second case EC calculations will be made in SHD process right?
>>> Does brick process has any role in EC calculations?
>>>
>>> On Mon, May 29, 2017 at 3:32 PM, Ashish Pandey <aspandey@xxxxxxxxxx>
>>> wrote:
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: "Serkan Çoban" <cobanserkan@xxxxxxxxx>
>>>> To: "Gluster Users" <gluster-users@xxxxxxxxxxx>
>>>> Sent: Monday, May 29, 2017 5:13:06 PM
>>>> Subject:  Heal operation detail of EC volumes
>>>>
>>>> Hi,
>>>>
>>>> When a brick fails in EC, What is the healing read/write data path?
>>>> Which processes do the operations?
>>>>
>>>> Healing could be triggered by client side (access of file) or server
>>>> side
>>>> (shd).
>>>> However, in both the cases actual heal starts from "ec_heal_do"
>>>> function.
>>>>
>>>>
>>>> Assume a 2GB file is being healed in 16+4 EC configuration. I was
>>>> thinking that SHD deamon on failed brick host will read 2GB from
>>>> network and reconstruct its 100MB chunk and write it on to brick. Is
>>>> this right?
>>>>
>>>> You are correct about read/write.
>>>> The only point is that, SHD deamon on one of the good brick will pick
>>>> the
>>>> index entry and heal it.
>>>> SHD deamon scans the .glusterfs/index directory and heals the entries.
>>>> If
>>>> the brick went down while IO was going on, index will be present on
>>>> killed
>>>> brick also.
>>>> However, if a brick was down and then you started writing on a file then
>>>> in
>>>> this case index entry would not be present on killed brick.
>>>> So even after brick will be  UP, sdh on that brick will not be able to
>>>> find
>>>> it out this index. However, other bricks would have entries and shd on
>>>> that
>>>> brick will heal it.
>>>>
>>>> Note: I am considering each brick on different node.
>>>>
>>>> Ashish
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users@xxxxxxxxxxx
>>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users@xxxxxxxxxxx
>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@xxxxxxxxxxx
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users