Re: Heal operation detail of EC volumes

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Thu, 8 Jun 2017 12:49:40 +0530

On Fri, Jun 2, 2017 at 1:01 AM, Serkan Çoban <cobanserkan@xxxxxxxxx> wrote:
>Is it possible that this matches your observations ?

Yes that matches what I see. So 19 files is being in parallel by 19

SHD processes. I thought only one file is being healed at a time.

Then what is the meaning of disperse.shd-max-threads parameter? If I

set it to 2 then each SHD thread will heal two files at the same time?

Yes that is the idea.

>How many IOPS can handle your bricks ?

Bricks are 7200RPM NL-SAS disks. 70-80 random IOPS max. But write

pattern seems sequential, 30-40MB bulk writes every 4-5 seconds.

This is what iostat shows.

>Do you have a test environment where we could check all this ?

Not currently but will have in 4-5 weeks. New servers are arriving, I

will add this test to my notes.

> There's a feature to allow to configure the self-heal block size to optimize these cases. The feature is available on 3.11.

I did not see this in 3.11 release notes, what parameter name I should look for?

disperse.self-heal-window-size

+    { .key  = {"self-heal-window-size"},
+        .type = GF_OPTION_TYPE_INT,
+        .min  = 1,
+        .max  = 1024,
+        .default_value = "1",
+        .description = "Maximum number blocks(128KB) per file for which "
+                       "self-heal process would be applied simultaneously."
+    },

This is the patch: https://review.gluster.org/17098

+Sunil,
     Could you add this to release notes please?

On Thu, Jun 1, 2017 at 10:30 AM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:

> Hi Serkan,

>

> On 30/05/17 10:22, Serkan Çoban wrote:

>>

>> Ok I understand that heal operation takes place on server side. In

>> this case I should see X KB

>>  out network traffic from 16 servers and 16X KB input traffic to the

>> failed brick server right? So that process will get 16 chunks

>> recalculate our chunk and write it to disk.

>

>

> That should be the normal operation for a single heal.

>

>> The problem is I am not seeing such kind of traffic on servers. In my

>> configuration (16+4 EC) I see 20 servers are all have 7-8MB outbound

>> traffic and none of them has more than 10MB incoming traffic.

>> Only heal operation is happening on cluster right now, no client/other

>> traffic. I see constant 7-8MB write to healing brick disk. So where is

>> the missing traffic?

>

>

> Not sure about your configuration, but probably you are seeing the result of

> having the SHD of each server doing heals. That would explain the network

> traffic you have.

>

> Suppose that all SHD but the one on the damaged brick are working. In this

> case 19 servers will peek 16 fragments each. This gives 19 * 16 = 304

> fragments to be requested. EC balances the reads among all available

> servers, and there's a chance (1/19) that a fragment is local to the server

> asking it. So we'll need a total of 304 - 304 / 19 = 288 network requests,

> 288 / 19 = 15.2 sent by each server.

>

> If we have a total of 288 requests, it means that each server will answer

> 288 / 19 = 15.2 requests. The net effect of all this is that each healthy

> server is sending 15.2*X bytes of data and each server is receiving 15.2*X

> bytes of data.

>

> Now we need to account for the writes to the damaged brick. We have 19

> simultaneous heals. This means that the damaged brick will receive 19*X

> bytes of data, and each healthy server will send X additional bytes of data.

>

> So:

>

> A healthy server receives 15.2*X bytes of data

> A healthy server sends 16.2*X bytes of data

> A damaged server receives 19*X bytes of data

> A damaged server sends few bytes of data (communication and synchronization

> overhead basically)

>

> As you can see, in this configuration each server has almost the same amount

> of inbound and outbound traffic. Only big difference is the damaged brick,

> that should receive a little more of traffic, but it should send much less.

>

> Is it possible that this matches your observations ?

>

> There's one more thing to consider here, and it's the apparent low

> throughput of self-heal. One possible thing to check is the small size and

> random behavior of the requests.

>

> Assuming that each request has a size of ~128 / 16 = 8KB, at a rate of ~8

> MB/s the servers are processing ~1000 IOPS. Since requests are going to 19

> different files, even if each file is accessed sequentially, the real effect

> will be like random access (some read-ahead on the filesystem can improve

> reads a bit, but writes won't benefit so much).

>

> How many IOPS can handle your bricks ?

>

> Do you have a test environment where we could check all this ? if possible

> it would be interesting to have only a single SHD (kill all SHD from all

> servers but one). In this situation, without client accesses, we should see

> the 16/1 ratio of reads vs writes on the network. We should also see a

> similar of even a little better speed because all reads and writes will be

> sequential, optimizing available IOPS.

>

> There's a feature to allow to configure the self-heal block size to optimize

> these cases. The feature is available on 3.11.

>

> Best regards,

>

> Xavi

>

>

>>

>> On Tue, May 30, 2017 at 10:25 AM, Ashish Pandey <aspandey@xxxxxxxxxx>

>> wrote:

>>>

>>>

>>> When we say client side heal or server side heal, we basically talking

>>> about

>>> the side which "triggers" heal of a file.

>>>

>>> 1 - server side heal - shd scans indices and triggers heal

>>>

>>> 2 - client side heal - a fop finds that file needs heal and it triggers

>>> heal

>>> for that file.

>>>

>>> Now, what happens when heal gets triggered.

>>> In both  the cases following functions takes part -

>>>

>>> ec_heal => ec_heal_throttle=>ec_launch_heal

>>>

>>> Now ec_launch_heal just creates heal tasks (with ec_synctask_heal_wrap

>>> which

>>> calls ec_heal_do ) and put it into a queue.

>>> This happens on server and "syncenv" infrastructure which is nothing but

>>> a

>>> set of workers pick these tasks and execute it. That is when actual

>>> read/write for

>>> heal happens.

>>>

>>>

>>> ________________________________

>>> From: "Serkan Çoban" <cobanserkan@xxxxxxxxx>

>>> To: "Ashish Pandey" <aspandey@xxxxxxxxxx>

>>> Cc: "Gluster Users" <gluster-users@xxxxxxxxxxx>

>>> Sent: Monday, May 29, 2017 6:44:50 PM

>>> Subject: Re:  Heal operation detail of EC volumes

>>>

>>>

>>>>> Healing could be triggered by client side (access of file) or server

>>>>> side

>>>>> (shd).

>>>>> However, in both the cases actual heal starts from "ec_heal_do"

>>>>> function.

>>>

>>> If I do a recursive getfattr operation from clients, then all heal

>>> operation is done on clients right? Client read the chunks, calculate

>>> and write the missing chunk.

>>> And If I don't access files from client then SHD daemons will start

>>> heal and read,calculate,write the missing chunks right?

>>>

>>> In first case EC calculations takes places in client fuse process, in

>>> second case EC calculations will be made in SHD process right?

>>> Does brick process has any role in EC calculations?

>>>

>>> On Mon, May 29, 2017 at 3:32 PM, Ashish Pandey <aspandey@xxxxxxxxxx>

>>> wrote:

>>>>

>>>>

>>>>

>>>> ________________________________

>>>> From: "Serkan Çoban" <cobanserkan@xxxxxxxxx>

>>>> To: "Gluster Users" <gluster-users@xxxxxxxxxxx>

>>>> Sent: Monday, May 29, 2017 5:13:06 PM

>>>> Subject:  Heal operation detail of EC volumes

>>>>

>>>> Hi,

>>>>

>>>> When a brick fails in EC, What is the healing read/write data path?

>>>> Which processes do the operations?

>>>>

>>>> Healing could be triggered by client side (access of file) or server

>>>> side

>>>> (shd).

>>>> However, in both the cases actual heal starts from "ec_heal_do"

>>>> function.

>>>>

>>>>

>>>> Assume a 2GB file is being healed in 16+4 EC configuration. I was

>>>> thinking that SHD deamon on failed brick host will read 2GB from

>>>> network and reconstruct its 100MB chunk and write it on to brick. Is

>>>> this right?

>>>>

>>>> You are correct about read/write.

>>>> The only point is that, SHD deamon on one of the good brick will pick

>>>> the

>>>> index entry and heal it.

>>>> SHD deamon scans the .glusterfs/index directory and heals the entries.

>>>> If

>>>> the brick went down while IO was going on, index will be present on

>>>> killed

>>>> brick also.

>>>> However, if a brick was down and then you started writing on a file then

>>>> in

>>>> this case index entry would not be present on killed brick.

>>>> So even after brick will be  UP, sdh on that brick will not be able to

>>>> find

>>>> it out this index. However, other bricks would have entries and shd on

>>>> that

>>>> brick will heal it.

>>>>

>>>> Note: I am considering each brick on different node.

>>>>

>>>> Ashish

>>>>

>>>>

>>>>

>>>>

>>>>

>>>>

>>>>

>>>>

>>>>

>>>>

>>>> _______________________________________________

>>>> Gluster-users mailing list

>>>> Gluster-users@xxxxxxxxxxx

>>>> http://lists.gluster.org/mailman/listinfo/gluster-users

>>>>

>>> _______________________________________________

>>> Gluster-users mailing list

>>> Gluster-users@xxxxxxxxxxx

>>> http://lists.gluster.org/mailman/listinfo/gluster-users

>>>

>> _______________________________________________

>> Gluster-users mailing list

>> Gluster-users@xxxxxxxxxxx

>> http://lists.gluster.org/mailman/listinfo/gluster-users

>>

>

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users

-- 
Pranith

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users