Re: Freezing during heal

Lindsay Mathieson <lindsay.mathieson@xxxxxxxxx> · Mon, 25 Apr 2016 22:50:53 +1000

    Good luck!

      On 25/04/2016 10:01 PM, Kevin Lemonnier wrote:

      Hi,

So I'm trying that now.
I installed 3.7.11 on two nodes and put a few VMs on it, same config
as before but with 64MB shards and the heal algo to full. As expected,
if I poweroff one of the nodes, everything is dead, which is fine.

Now I'm adding a third node, a big heal was started after the add-brick
of everything (7000+ shards), and for now everything seems to be working
fine on the VMs. Last time I tried adding a brick, all those VM died for
the duration of the heal, so that's already pretty good.

I'm gonna let it finish to copy everything on the new nodes, then I'll try
to simulate nodes going down to see if my original problem of freezing and
low heal time is solved with this config.
For reference, here is the volume info, if someone sees something I should change :

Volume Name: gluster
Type: Replicate
Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: ipvr2.client_name:/mnt/storage/gluster
Brick2: ipvr3.client_name:/mnt/storage/gluster
Brick3: ipvr50.client_name:/mnt/storage/gluster
Options Reconfigured:
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 64MB
cluster.data-self-heal-algorithm: full
performance.readdir-ahead: on

It starts at 2 and jumps to 50 because the first server is doing something else for now,
and I use 50 to be the temporary third node. If everything goes well, I'll migrate the production
on the cluster, re-install the first server and do a replace-brick, which I hope will work just as well
as the add-brick I'm doing now. Last replace-brick also brought everything down, but I guess that was the
joy of 3.7.6 :).

Thanks !

On Mon, Apr 18, 2016 at 08:17:05PM +0530, Krutika Dhananjay wrote:

        On Mon, Apr 18, 2016 at 8:02 PM, Kevin Lemonnier <lemonnierk@xxxxxxxxx>
wrote:

          I will try migrating to 3.7.10, is it considered stable yet ?

        Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)

          Should I change the self heal algorithm even if I move to 3.7.10, or is
that not necessary ?
Not sure what that change might do.

        So the other algorithm is 'diff' which computes rolling checksum on chunks
of the src(es) and sink(s), compares them and heals upon mismatch. This is
known to consume lot of CPU. 'full' algo on the other hand simply copies
the src into sink in chunks. With sharding, it shouldn't be all that bad
copying a 256MB file (in your case) from src to sink. We've used double the
block size and had no issues reported.

So you could change self heal algo to full even in the upgraded cluster.

-Krutika

          Anyway, I'll try to create a 3.7.10 cluster in the week end slowly move
the VMs on it then,
Thanks a lot for your help,

Regards

On Mon, Apr 18, 2016 at 07:58:44PM +0530, Krutika Dhananjay wrote:

            Hi,

Yeah, so the fuse mount log didn't convey much information.

So one of the reasons heal may have taken so long (and also consumed
resources) is because of a bug in self-heal where it would do heal from
both source bricks in 3-way replication. With such a bug, heal would take
twice the amount of time and consume resources both the times by the same
amount.

This issue is fixed at http://review.gluster.org/#/c/14008/ and will be
available in 3.7.12.

The other thing you could do is to set cluster.data-self-heal-algorithm

          to

            'full', for better heal performance and more regulated resource

          consumption

            by the same.
 #gluster volume set <VOL> cluster.data-self-heal-algorithm full

As far as sharding is concerned, some critical caching issues were fixed

          in

            3.7.7 and 3.7.8.
And my guess is that the vm crash/unbootable state could be because of

          this

            issue, which exists in 3.7.6.

3.7.10 saw the introduction of throttled client side heals which also

          moves

            such heals to the background, which is all the more helpful for

          preventing

            starvation of vms during client heal.

Considering these factors, I think it would be better if you upgraded

          your

            machines to 3.7.10.

Do let me know if migrating to 3.7.10 solves your issues.

-Krutika

On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <lemonnierk@xxxxxxxxx>
wrote:

              Yes, but as I was saying I don't believe KVM is using a mount point, I
think it uses
the API (

          http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt

              ).
Might be mistaken ofcourse. Proxmox does have a mountpoint for
conveniance, I'll attach those
logs, hoping they contain the informations you need. They do seem to
contain a lot of errors
for the 15.
For reference, there was a disconnect of the first brick (10.10.0.1) in
the morning and then a successfull
heal that caused about 40 minutes downtime of the VMs. Right after that
heal finished (if my memory is
correct it was about noon or close) the second brick (10.10.0.2)

          rebooted,

              and that's the one I disconnected
to prevent the heal from causing another downtime.
I reconnected it one at the end of the afternoon, hoping the heal

          would go

              well but everything went down
like in the morning so I disconnected it again, and waited 11pm

          (23:00) to

              reconnect it and let it finish.

Thanks for your help,

On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay wrote:

                Sorry, I was referring to the glusterfs client logs.

Assuming you are using FUSE mount, your log file will be in
/var/log/glusterfs/<hyphenated-mount-point-path>.log

-Krutika

On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <

          lemonnierk@xxxxxxxxx>

                wrote:

                  I believe Proxmox is just an interface to KVM that uses the lib,

          so if

              I'm

                  not mistaken there isn't client logs ?

It's not the first time I have the issue, it happens on every heal

          on

              the

                  2 clusters I have.

I did let the heal finish that night and the VMs are working now,

          but

              it

                  is pretty scarry for future crashes or brick replacement.
Should I maybe lower the shard size ? Won't solve the fact that 2

              bricks

                  on 3 aren't keeping the filesystem usable but might make the

          healing

                  quicker right ?

Thanks

Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
kdhananj@xxxxxxxxxx> a écrit :

                    Could you share the client logs and information about the approx
time/day
when you saw this issue?

-Krutika

On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
<lemonnierk@xxxxxxxxx>
wrote:

                      Hi,

We have a small glusterFS 3.7.6 cluster with 3 nodes running

          with

                    proxmox

                      VM's on it. I did set up the different recommended option like

          the

                    virt

                      group, but
by hand since it's on debian. The shards are 256MB, if that

          matters.

                      This morning the second node crashed, and as it came back up

          started

                    a

                      heal, but that basically froze all the VM's running on that

          volume.

                    Since

                      we really really
can't have 40 minutes down time in the middle of the day, I just

                    removed

                      the node from the network and that stopped the heal, allowing

          the

                    VM's to

                      access
their disks again. The plan was to re-connecte the node in a

          couple

                    of

                      hours to let it heal at night.
But a VM crashed now, and it can't boot up again : seems to

          freez

                    trying

                      to access the disks.

Looking at the heal info for the volume, it has gone way up

          since

                    this

                      morning, it looks like the VM's aren't writing to both nodes,

          just

                    the one

                      they are on.
It seems pretty bad, we have 2 nodes on 3 up, I would expect the

                    volume to

                      work just fine since it has quorum. What am I missing ?

It is still too early to start the heal, is there a way to

          start the

                    VM

                      anyway right now ? I mean, it was running a moment ago so the

          data

              is

                      there, it just needs
to let the VM access it.

Volume Name: vm-storage
Type: Replicate
Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: first_node:/mnt/vg1-storage
Brick2: second_node:/mnt/vg1-storage
Brick3: third_node:/mnt/vg1-storage
Options Reconfigured:
cluster.quorum-type: auto
cluster.server-quorum-type: server
network.remote-dio: enable
cluster.eager-lock: enable
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
features.shard: on
features.shard-block-size: 256MB
cluster.server-quorum-ratio: 51%

Thanks for your help

--
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

                  --
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma

              brièveté.

                  _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

              --
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

          --
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

    -- 
Lindsay Mathieson

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users