Re: Freezing during heal

Kevin Lemonnier <lemonnierk@xxxxxxxxx> · Mon, 2 May 2016 11:57:19 +0200

Like last time, I'm using Proxmox so I don't have a client log, it's using the lib.
I attach the shd log from the first node, do you need the other two maybe ?
I tar.gz'ed it, hope that's okay. In case it's not clear from the logs, I removed the brick
on ipvr50 then added it again (after rm -Rf /mnt/storage/gluster on it, of course).

thanks

On Mon, May 02, 2016 at 02:49:35PM +0530, Krutika Dhananjay wrote:
> Could you attach the glusterfs client, shd logs?
> 
> -Krutika
> 
> On Mon, May 2, 2016 at 2:35 PM, Kevin Lemonnier <lemonnierk@xxxxxxxxx>
> wrote:
> 
> > Hi,
> >
> > So after some testing, it is a lot better but I do still have some
> > problems with 3.7.11.
> > When I reboot a server it seems to have some strange behaviour sometimes,
> > but I need to test
> > that better.
> > Removing a server from the network, waiting for a while then adding it
> > back and letting it heal
> > works perfectly, completly invisible for the user and that's perfect !
> >
> > However when I add a brick, changing the replica count from 2 to 3, it
> > starts a heal
> > and some VMs switch to read only. I have to power them off then on again
> > to fix it,
> > clearly it's better than with 3.7.6 which froze the VM until the heal was
> > complete,
> > but I would still like to understand why some of the VMs are switching to
> > readonly.
> > Looks like it happens everytime I add a brick to increase the replica, I
> > would like
> > to test adding a whole replica set at once but I just don't have the
> > hardware for that.
> >
> > Rebooting a node looks like it's making some VMs go read only too, but I
> > need to test
> > that better. For some reason it looks like rebooting a brick or adding a
> > brick is causing
> > I/O errors on some VM disks and not others, and I have to power them off
> > and then on to fix it.
> > I can't just reboot them, I guess I have to actually re-open the file to
> > trigger a heal ?
> >
> > Any idea on how to prevent that ? It's a lot better than 3.7.6 'cause it
> > can be fixed in a minute,
> > but that's still not great to explain to the clients.
> >
> > Thanks
> >
> >
> > On Mon, Apr 25, 2016 at 02:01:09PM +0200, Kevin Lemonnier wrote:
> > > Hi,
> > >
> > > So I'm trying that now.
> > > I installed 3.7.11 on two nodes and put a few VMs on it, same config
> > > as before but with 64MB shards and the heal algo to full. As expected,
> > > if I poweroff one of the nodes, everything is dead, which is fine.
> > >
> > > Now I'm adding a third node, a big heal was started after the add-brick
> > > of everything (7000+ shards), and for now everything seems to be working
> > > fine on the VMs. Last time I tried adding a brick, all those VM died for
> > > the duration of the heal, so that's already pretty good.
> > >
> > > I'm gonna let it finish to copy everything on the new nodes, then I'll
> > try
> > > to simulate nodes going down to see if my original problem of freezing
> > and
> > > low heal time is solved with this config.
> > > For reference, here is the volume info, if someone sees something I
> > should change :
> > >
> > > Volume Name: gluster
> > > Type: Replicate
> > > Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> > > Status: Started
> > > Number of Bricks: 1 x 3 = 3
> > > Transport-type: tcp
> > > Bricks:
> > > Brick1: ipvr2.client_name:/mnt/storage/gluster
> > > Brick2: ipvr3.client_name:/mnt/storage/gluster
> > > Brick3: ipvr50.client_name:/mnt/storage/gluster
> > > Options Reconfigured:
> > > cluster.quorum-type: auto
> > > cluster.server-quorum-type: server
> > > network.remote-dio: enable
> > > cluster.eager-lock: enable
> > > performance.quick-read: off
> > > performance.read-ahead: off
> > > performance.io-cache: off
> > > performance.stat-prefetch: off
> > > features.shard: on
> > > features.shard-block-size: 64MB
> > > cluster.data-self-heal-algorithm: full
> > > performance.readdir-ahead: on
> > >
> > >
> > > It starts at 2 and jumps to 50 because the first server is doing
> > something else for now,
> > > and I use 50 to be the temporary third node. If everything goes well,
> > I'll migrate the production
> > > on the cluster, re-install the first server and do a replace-brick,
> > which I hope will work just as well
> > > as the add-brick I'm doing now. Last replace-brick also brought
> > everything down, but I guess that was the
> > > joy of 3.7.6 :).
> > >
> > > Thanks !
> > >
> > >
> > > On Mon, Apr 18, 2016 at 08:17:05PM +0530, Krutika Dhananjay wrote:
> > > > On Mon, Apr 18, 2016 at 8:02 PM, Kevin Lemonnier <lemonnierk@xxxxxxxxx
> > >
> > > > wrote:
> > > >
> > > > > I will try migrating to 3.7.10, is it considered stable yet ?
> > > > >
> > > >
> > > > Oops, just realized 3.7.10 had a regression. Then 3.7.11 it is. :)
> > > >
> > > >
> > > > > Should I change the self heal algorithm even if I move to 3.7.10, or
> > is
> > > > > that not necessary ?
> > > > > Not sure what that change might do.
> > > > >
> > > >
> > > > So the other algorithm is 'diff' which computes rolling checksum on
> > chunks
> > > > of the src(es) and sink(s), compares them and heals upon mismatch.
> > This is
> > > > known to consume lot of CPU. 'full' algo on the other hand simply
> > copies
> > > > the src into sink in chunks. With sharding, it shouldn't be all that
> > bad
> > > > copying a 256MB file (in your case) from src to sink. We've used
> > double the
> > > > block size and had no issues reported.
> > > >
> > > > So you could change self heal algo to full even in the upgraded
> > cluster.
> > > >
> > > > -Krutika
> > > >
> > > >
> > > > >
> > > > > Anyway, I'll try to create a 3.7.10 cluster in the week end slowly
> > move
> > > > > the VMs on it then,
> > > > > Thanks a lot for your help,
> > > > >
> > > > > Regards
> > > > >
> > > > >
> > > > > On Mon, Apr 18, 2016 at 07:58:44PM +0530, Krutika Dhananjay wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Yeah, so the fuse mount log didn't convey much information.
> > > > > >
> > > > > > So one of the reasons heal may have taken so long (and also
> > consumed
> > > > > > resources) is because of a bug in self-heal where it would do heal
> > from
> > > > > > both source bricks in 3-way replication. With such a bug, heal
> > would take
> > > > > > twice the amount of time and consume resources both the times by
> > the same
> > > > > > amount.
> > > > > >
> > > > > > This issue is fixed at http://review.gluster.org/#/c/14008/ and
> > will be
> > > > > > available in 3.7.12.
> > > > > >
> > > > > > The other thing you could do is to set
> > cluster.data-self-heal-algorithm
> > > > > to
> > > > > > 'full', for better heal performance and more regulated resource
> > > > > consumption
> > > > > > by the same.
> > > > > >  #gluster volume set <VOL> cluster.data-self-heal-algorithm full
> > > > > >
> > > > > > As far as sharding is concerned, some critical caching issues were
> > fixed
> > > > > in
> > > > > > 3.7.7 and 3.7.8.
> > > > > > And my guess is that the vm crash/unbootable state could be
> > because of
> > > > > this
> > > > > > issue, which exists in 3.7.6.
> > > > > >
> > > > > > 3.7.10 saw the introduction of throttled client side heals which
> > also
> > > > > moves
> > > > > > such heals to the background, which is all the more helpful for
> > > > > preventing
> > > > > > starvation of vms during client heal.
> > > > > >
> > > > > > Considering these factors, I think it would be better if you
> > upgraded
> > > > > your
> > > > > > machines to 3.7.10.
> > > > > >
> > > > > > Do let me know if migrating to 3.7.10 solves your issues.
> > > > > >
> > > > > > -Krutika
> > > > > >
> > > > > > On Mon, Apr 18, 2016 at 12:40 PM, Kevin Lemonnier <
> > lemonnierk@xxxxxxxxx>
> > > > > > wrote:
> > > > > >
> > > > > > > Yes, but as I was saying I don't believe KVM is using a mount
> > point, I
> > > > > > > think it uses
> > > > > > > the API (
> > > > > > >
> > > > >
> > http://www.gluster.org/community/documentation/index.php/Libgfapi_with_qemu_libvirt
> > > > > > > ).
> > > > > > > Might be mistaken ofcourse. Proxmox does have a mountpoint for
> > > > > > > conveniance, I'll attach those
> > > > > > > logs, hoping they contain the informations you need. They do
> > seem to
> > > > > > > contain a lot of errors
> > > > > > > for the 15.
> > > > > > > For reference, there was a disconnect of the first brick
> > (10.10.0.1) in
> > > > > > > the morning and then a successfull
> > > > > > > heal that caused about 40 minutes downtime of the VMs. Right
> > after that
> > > > > > > heal finished (if my memory is
> > > > > > > correct it was about noon or close) the second brick (10.10.0.2)
> > > > > rebooted,
> > > > > > > and that's the one I disconnected
> > > > > > > to prevent the heal from causing another downtime.
> > > > > > > I reconnected it one at the end of the afternoon, hoping the heal
> > > > > would go
> > > > > > > well but everything went down
> > > > > > > like in the morning so I disconnected it again, and waited 11pm
> > > > > (23:00) to
> > > > > > > reconnect it and let it finish.
> > > > > > >
> > > > > > > Thanks for your help,
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Apr 18, 2016 at 12:28:28PM +0530, Krutika Dhananjay
> > wrote:
> > > > > > > > Sorry, I was referring to the glusterfs client logs.
> > > > > > > >
> > > > > > > > Assuming you are using FUSE mount, your log file will be in
> > > > > > > > /var/log/glusterfs/<hyphenated-mount-point-path>.log
> > > > > > > >
> > > > > > > > -Krutika
> > > > > > > >
> > > > > > > > On Sun, Apr 17, 2016 at 9:37 PM, Kevin Lemonnier <
> > > > > lemonnierk@xxxxxxxxx>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I believe Proxmox is just an interface to KVM that uses the
> > lib,
> > > > > so if
> > > > > > > I'm
> > > > > > > > > not mistaken there isn't client logs ?
> > > > > > > > >
> > > > > > > > > It's not the first time I have the issue, it happens on
> > every heal
> > > > > on
> > > > > > > the
> > > > > > > > > 2 clusters I have.
> > > > > > > > >
> > > > > > > > > I did let the heal finish that night and the VMs are working
> > now,
> > > > > but
> > > > > > > it
> > > > > > > > > is pretty scarry for future crashes or brick replacement.
> > > > > > > > > Should I maybe lower the shard size ? Won't solve the fact
> > that 2
> > > > > > > bricks
> > > > > > > > > on 3 aren't keeping the filesystem usable but might make the
> > > > > healing
> > > > > > > > > quicker right ?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > Le 17 avril 2016 17:56:37 GMT+02:00, Krutika Dhananjay <
> > > > > > > > > kdhananj@xxxxxxxxxx> a écrit :
> > > > > > > > > >Could you share the client logs and information about the
> > approx
> > > > > > > > > >time/day
> > > > > > > > > >when you saw this issue?
> > > > > > > > > >
> > > > > > > > > >-Krutika
> > > > > > > > > >
> > > > > > > > > >On Sat, Apr 16, 2016 at 12:57 AM, Kevin Lemonnier
> > > > > > > > > ><lemonnierk@xxxxxxxxx>
> > > > > > > > > >wrote:
> > > > > > > > > >
> > > > > > > > > >> Hi,
> > > > > > > > > >>
> > > > > > > > > >> We have a small glusterFS 3.7.6 cluster with 3 nodes
> > running
> > > > > with
> > > > > > > > > >proxmox
> > > > > > > > > >> VM's on it. I did set up the different recommended option
> > like
> > > > > the
> > > > > > > > > >virt
> > > > > > > > > >> group, but
> > > > > > > > > >> by hand since it's on debian. The shards are 256MB, if
> > that
> > > > > matters.
> > > > > > > > > >>
> > > > > > > > > >> This morning the second node crashed, and as it came back
> > up
> > > > > started
> > > > > > > > > >a
> > > > > > > > > >> heal, but that basically froze all the VM's running on
> > that
> > > > > volume.
> > > > > > > > > >Since
> > > > > > > > > >> we really really
> > > > > > > > > >> can't have 40 minutes down time in the middle of the day,
> > I just
> > > > > > > > > >removed
> > > > > > > > > >> the node from the network and that stopped the heal,
> > allowing
> > > > > the
> > > > > > > > > >VM's to
> > > > > > > > > >> access
> > > > > > > > > >> their disks again. The plan was to re-connecte the node
> > in a
> > > > > couple
> > > > > > > > > >of
> > > > > > > > > >> hours to let it heal at night.
> > > > > > > > > >> But a VM crashed now, and it can't boot up again : seems
> > to
> > > > > freez
> > > > > > > > > >trying
> > > > > > > > > >> to access the disks.
> > > > > > > > > >>
> > > > > > > > > >> Looking at the heal info for the volume, it has gone way
> > up
> > > > > since
> > > > > > > > > >this
> > > > > > > > > >> morning, it looks like the VM's aren't writing to both
> > nodes,
> > > > > just
> > > > > > > > > >the one
> > > > > > > > > >> they are on.
> > > > > > > > > >> It seems pretty bad, we have 2 nodes on 3 up, I would
> > expect the
> > > > > > > > > >volume to
> > > > > > > > > >> work just fine since it has quorum. What am I missing ?
> > > > > > > > > >>
> > > > > > > > > >> It is still too early to start the heal, is there a way to
> > > > > start the
> > > > > > > > > >VM
> > > > > > > > > >> anyway right now ? I mean, it was running a moment ago so
> > the
> > > > > data
> > > > > > > is
> > > > > > > > > >> there, it just needs
> > > > > > > > > >> to let the VM access it.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> Volume Name: vm-storage
> > > > > > > > > >> Type: Replicate
> > > > > > > > > >> Volume ID: a5b19324-f032-4136-aaac-5e9a4c88aaef
> > > > > > > > > >> Status: Started
> > > > > > > > > >> Number of Bricks: 1 x 3 = 3
> > > > > > > > > >> Transport-type: tcp
> > > > > > > > > >> Bricks:
> > > > > > > > > >> Brick1: first_node:/mnt/vg1-storage
> > > > > > > > > >> Brick2: second_node:/mnt/vg1-storage
> > > > > > > > > >> Brick3: third_node:/mnt/vg1-storage
> > > > > > > > > >> Options Reconfigured:
> > > > > > > > > >> cluster.quorum-type: auto
> > > > > > > > > >> cluster.server-quorum-type: server
> > > > > > > > > >> network.remote-dio: enable
> > > > > > > > > >> cluster.eager-lock: enable
> > > > > > > > > >> performance.readdir-ahead: on
> > > > > > > > > >> performance.quick-read: off
> > > > > > > > > >> performance.read-ahead: off
> > > > > > > > > >> performance.io-cache: off
> > > > > > > > > >> performance.stat-prefetch: off
> > > > > > > > > >> features.shard: on
> > > > > > > > > >> features.shard-block-size: 256MB
> > > > > > > > > >> cluster.server-quorum-ratio: 51%
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> Thanks for your help
> > > > > > > > > >>
> > > > > > > > > >> --
> > > > > > > > > >> Kevin Lemonnier
> > > > > > > > > >> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > > > > > > > > >>
> > > > > > > > > >> _______________________________________________
> > > > > > > > > >> Gluster-users mailing list
> > > > > > > > > >> Gluster-users@xxxxxxxxxxx
> > > > > > > > > >> http://www.gluster.org/mailman/listinfo/gluster-users
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Envoyé de mon appareil Android avec K-9 Mail. Veuillez
> > excuser ma
> > > > > > > brièveté.
> > > > > > > > > _______________________________________________
> > > > > > > > > Gluster-users mailing list
> > > > > > > > > Gluster-users@xxxxxxxxxxx
> > > > > > > > > http://www.gluster.org/mailman/listinfo/gluster-users
> > > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Kevin Lemonnier
> > > > > > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Gluster-users mailing list
> > > > > > > Gluster-users@xxxxxxxxxxx
> > > > > > > http://www.gluster.org/mailman/listinfo/gluster-users
> > > > > > >
> > > > >
> > > > > --
> > > > > Kevin Lemonnier
> > > > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > > > >
> > > > > _______________________________________________
> > > > > Gluster-users mailing list
> > > > > Gluster-users@xxxxxxxxxxx
> > > > > http://www.gluster.org/mailman/listinfo/gluster-users
> > > > >
> > >
> > > --
> > > Kevin Lemonnier
> > > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> >
> >
> >
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users@xxxxxxxxxxx
> > > http://www.gluster.org/mailman/listinfo/gluster-users
> >
> >
> > --
> > Kevin Lemonnier
> > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> >
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users@xxxxxxxxxxx
> > http://www.gluster.org/mailman/listinfo/gluster-users
> >

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
Attachment:
glustershd.log.tgz

Description: GNU Unix tar archive
Attachment:
signature.asc

Description: Digital signature
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users