Re: GlusterFS healing questions

Xavi Hernandez <jahernan@xxxxxxxxxx> · Thu, 16 Nov 2017 08:18:17 +0100

Hi,
On Thu, Nov 9, 2017 at 7:47 PM,  <ingard@xxxxxxxx> wrote:
Someone on the #gluster-users irc channel said the following :

"Decreasing features.locks-revocation-max-blocked to an absurdly low number is letting our distributed-disperse set heal again."

Is this something to concider? Does anyone else have experience with tweaking this to speed up healing?

What that option does is to release the currently granted lock for a file when a new lock request comes and there are more than features-revocation-max-blocked locks already pending. The real effect of this is that all pending requests can proceed, but the client that was using the old granted lock will continue working without knowing that it doesn't have the lock, meaning that everything it does can cause corruption. This option was created to avoid that a single bad client could block the entire cluster, but if you set this option to a really small value compared to your workload, it will probably cause unwanted effects even or perfectly healthy and well-behaving clients.

This option shouldn't be used unless you are very sure of what your users are doing with the volume and how, and you know the implications of this option. Otherwise this is a good candidate to have data corruption.

Anyway, if setting this option improves speed, it means that there's a heavy lock usage. It should be determined if that usage is normal or not. For example, disperse needs to take locks for reads and writes. If a file is being simultaneously accessed by multiple clients, the lock usage will be high because of contention between clients, but normal. Forcing the release of some locks while another client is trying to write (for example self-heal), will probably cause read errors to other clients.

If a real problem is detected, it's better to file a bug with as much information as you can give to try to resolve the problem (if there's a real problem) or to try to improve performance (if everything works fine but slow).

Xavi

Sent from my iPhone

> On 9 Nov 2017, at 18:00, Serkan Çoban <cobanserkan@xxxxxxxxx> wrote:

>

> Hi,

>

> You can set disperse.shd-max-threads to 2 or 4 in order to make heal

> faster. This makes my heal times 2-3x faster.

> Also you can play with disperse.self-heal-window-size to read more

> bytes at one time, but i did not test it.

>

>> On Thu, Nov 9, 2017 at 4:47 PM, Xavi Hernandez <jahernan@xxxxxxxxxx> wrote:

>> Hi Rolf,

>>

>> answers follow inline...

>>

>>> On Thu, Nov 9, 2017 at 3:20 PM, Rolf Larsen <rolf@xxxxxxxx> wrote:

>>>

>>> Hi,

>>>

>>> We ran a test on GlusterFS 3.12.1 with erasurecoded volumes 8+2 with 10

>>> bricks (default config,tested with 100gb, 200gb, 400gb bricksizes,10gbit

>>> nics)

>>>

>>> 1.

>>> Tests show that healing takes about double the time on healing 200gb vs

>>> 100, and abit under the double on 400gb vs 200gb bricksizes. Is this

>>> expected behaviour? In light of this would make 6,4 tb bricksizes use ~ 377

>>> hours to heal.

>>>

>>> 100gb brick heal: 18 hours (8+2)

>>> 200gb brick heal: 37 hours (8+2) +205%

>>> 400gb brick heal: 59 hours (8+2) +159%

>>>

>>> Each 100gb is filled with 80000 x 10mb files (200gb is 2x and 400gb is 4x)

>>

>>

>> If I understand it correctly, you are storing 80.000 files of 10 MB each

>> when you are using 100GB bricks, but you double this value for 200GB bricks

>> (160.000 files of 10MB each). And for 400GB bricks you create 320.000 files.

>> Have I understood it correctly ?

>>

>> If this is true, it's normal that twice the space requires approximately

>> twice the heal time. The healing time depends on the contents of the brick,

>> not brick size. The same amount of files should take the same healing time,

>> whatever the brick size is.

>>

>>>

>>>

>>> 2.

>>> Are there any possibility to show the progress of a heal? As per now we

>>> run gluster volume heal volume info, but this exit's when a brick is done

>>> healing and when we run heal info again the command contiunes showing gfid's

>>> until the brick is done again. This gives quite a bad picture of the status

>>> of a heal.

>>

>>

>> The output of 'gluster volume heal <volname> info' shows the list of files

>> pending to be healed on each brick. The heal is complete when the list is

>> empty. A faster alternative if you don't want to see the whole list of files

>> is to use 'gluster volume heal <volname> statistics heal-count'. This will

>> only show the number of pending files on each brick.

>>

>> I don't know any other way to track progress of self-heal.

>>

>>>

>>>

>>> 3.

>>> What kind of config tweaks is recommended for these kind of EC volumes?

>>

>>

>> I usually use the following values (specific only for ec):

>>

>> client.event-threads 4

>> server.event-threads 4

>> performance.client-io-threads on

>>

>> Regards,

>>

>> Xavi

>>

>>

>>

>>>

>>>

>>>

>>> $ gluster volume info

>>> Volume Name: test-ec-100g

>>> Type: Disperse

>>> Volume ID: 0254281d-2f6e-4ac4-a773-2b8e0eb8ab27

>>> Status: Started

>>> Snapshot Count: 0

>>> Number of Bricks: 1 x (8 + 2) = 10

>>> Transport-type: tcp

>>> Bricks:

>>> Brick1: dn-304:/mnt/test-ec-100/brick

>>> Brick2: dn-305:/mnt/test-ec-100/brick

>>> Brick3: dn-306:/mnt/test-ec-100/brick

>>> Brick4: dn-307:/mnt/test-ec-100/brick

>>> Brick5: dn-308:/mnt/test-ec-100/brick

>>> Brick6: dn-309:/mnt/test-ec-100/brick

>>> Brick7: dn-310:/mnt/test-ec-100/brick

>>> Brick8: dn-311:/mnt/test-ec-2/brick

>>> Brick9: dn-312:/mnt/test-ec-100/brick

>>> Brick10: dn-313:/mnt/test-ec-100/brick

>>> Options Reconfigured:

>>> nfs.disable: on

>>> transport.address-family: inet

>>>

>>> Volume Name: test-ec-200

>>> Type: Disperse

>>> Volume ID: 2ce23e32-7086-49c5-bf0c-7612fd7b3d5d

>>> Status: Started

>>> Snapshot Count: 0

>>> Number of Bricks: 1 x (8 + 2) = 10

>>> Transport-type: tcp

>>> Bricks:

>>> Brick1: dn-304:/mnt/test-ec-200/brick

>>> Brick2: dn-305:/mnt/test-ec-200/brick

>>> Brick3: dn-306:/mnt/test-ec-200/brick

>>> Brick4: dn-307:/mnt/test-ec-200/brick

>>> Brick5: dn-308:/mnt/test-ec-200/brick

>>> Brick6: dn-309:/mnt/test-ec-200/brick

>>> Brick7: dn-310:/mnt/test-ec-200/brick

>>> Brick8: dn-311:/mnt/test-ec-200_2/brick

>>> Brick9: dn-312:/mnt/test-ec-200/brick

>>> Brick10: dn-313:/mnt/test-ec-200/brick

>>> Options Reconfigured:

>>> nfs.disable: on

>>> transport.address-family: inet

>>>

>>> Volume Name: test-ec-400

>>> Type: Disperse

>>> Volume ID: fe00713a-7099-404d-ba52-46c6b4b6ecc0

>>> Status: Started

>>> Snapshot Count: 0

>>> Number of Bricks: 1 x (8 + 2) = 10

>>> Transport-type: tcp

>>> Bricks:

>>> Brick1: dn-304:/mnt/test-ec-400/brick

>>> Brick2: dn-305:/mnt/test-ec-400/brick

>>> Brick3: dn-306:/mnt/test-ec-400/brick

>>> Brick4: dn-307:/mnt/test-ec-400/brick

>>> Brick5: dn-308:/mnt/test-ec-400/brick

>>> Brick6: dn-309:/mnt/test-ec-400/brick

>>> Brick7: dn-310:/mnt/test-ec-400/brick

>>> Brick8: dn-311:/mnt/test-ec-400_2/brick

>>> Brick9: dn-312:/mnt/test-ec-400/brick

>>> Brick10: dn-313:/mnt/test-ec-400/brick

>>> Options Reconfigured:

>>> nfs.disable: on

>>> transport.address-family: inet

>>>

>>> --

>>>

>>> Regards

>>> Rolf Arne Larsen

>>> Ops Engineer

>>> rolf@xxxxxxxxxxxxxx

>>>

>>> _______________________________________________

>>> Gluster-users mailing list

>>> Gluster-users@xxxxxxxxxxx

>>> http://lists.gluster.org/mailman/listinfo/gluster-users

>>

>>

>>

>> _______________________________________________

>> Gluster-users mailing list

>> Gluster-users@xxxxxxxxxxx

>> http://lists.gluster.org/mailman/listinfo/gluster-users

> _______________________________________________

> Gluster-users mailing list

> Gluster-users@xxxxxxxxxxx

> http://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users