Re: VM disks corruption on 3.7.11

Kevin Lemonnier <lemonnierk@xxxxxxxxx> · Tue, 24 May 2016 12:24:44 +0200

So the VM were configured with cache set to none, I just tried with
cache=directsync and it seems to be fixing the issue. Still need to run
more test, but did a couple already with that option and no I/O errors.

Never had to do this before, is it known ? Found the clue in some old mail
from this mailing list, did I miss some doc saying you should be using
directsync with glusterfs ?

On Tue, May 24, 2016 at 11:33:28AM +0200, Kevin Lemonnier wrote:
> Hi,
> 
> Some news on this.
> I actually don't need to trigger a heal to get corruption, so the problem
> is not the healing. Live migrating the VM seems to trigger corruption every
> time, and even without that just doing a database import, rebooting then
> doing another import seems to corrupt as well.
> 
> To check I created local storages on each node on the same partition as the
> gluster bricks, on XFS, and moved the VM disk on each local storage and tested
> the same procedure one by one, no corruption. It seems to happen only on
> glusterFS, so I'm not so sure it's hardware anymore : if it was using local storage
> would corrupt too, right ?
> Could I be missing some critical configuration for VM storage on my gluster volume ?
> 
> 
> On Mon, May 23, 2016 at 01:54:30PM +0200, Kevin Lemonnier wrote:
> > Hi,
> > 
> > I didn't specify it but I use "localhost" to add the storage in proxmox.
> > My thinking is that every proxmox node is also a glusterFS node, so that
> > should work fine.
> > 
> > I don't want to use the "normal" way of setting a regular address in there
> > because you can't change it afterwards in proxmox, but could that be the source of
> > the problem, maybe during livre migration there is write comming from
> > two different servers at the same time ?
> > 
> > 
> > 
> > On Wed, May 18, 2016 at 07:11:08PM +0530, Krutika Dhananjay wrote:
> > >    Hi,
> > > 
> > >    I will try to recreate this issue tomorrow on my machines with the steps
> > >    that Lindsay provided in this thread. I will let you know the result soon
> > >    after that.
> > > 
> > >    -Krutika
> > > 
> > >    On Wednesday, May 18, 2016, Kevin Lemonnier <lemonnierk@xxxxxxxxx> wrote:
> > >    > Hi,
> > >    >
> > >    > Some news on this.
> > >    > Over the week end the RAID Card of the node ipvr2 died, and I thought
> > >    > that maybe that was the problem all along. The RAID Card was changed
> > >    > and yesterday I reinstalled everything.
> > >    > Same problem just now.
> > >    >
> > >    > My test is simple, using the website hosted on the VMs all the time
> > >    > I reboot ipvr50, wait for the heal to complete, migrate all the VMs off
> > >    > ipvr2 then reboot it, wait for the heal to complete then migrate all
> > >    > the VMs off ipvr3 then reboot it.
> > >    > Everytime the first database VM (which is the only one really using the
> > >    disk
> > >    > durign the heal) starts showing I/O errors on it's disk.
> > >    >
> > >    > Am I really the only one with that problem ?
> > >    > Maybe one of the drives is dying too, who knows, but SMART isn't saying
> > >    anything ..
> > >    >
> > >    >
> > >    > On Thu, May 12, 2016 at 04:03:02PM +0200, Kevin Lemonnier wrote:
> > >    >> Hi,
> > >    >>
> > >    >> I had a problem some time ago with 3.7.6 and freezing during heals,
> > >    >> and multiple persons advised to use 3.7.11 instead. Indeed, with that
> > >    >> version the freez problem is fixed, it works like a dream ! You can
> > >    >> almost not tell that a node is down or healing, everything keeps
> > >    working
> > >    >> except for a little freez when the node just went down and I assume
> > >    >> hasn't timed out yet, but that's fine.
> > >    >>
> > >    >> Now I have a 3.7.11 volume on 3 nodes for testing, and the VM are
> > >    proxmox
> > >    >> VMs with qCow2 disks stored on the gluster volume.
> > >    >> Here is the config :
> > >    >>
> > >    >> Volume Name: gluster
> > >    >> Type: Replicate
> > >    >> Volume ID: e4f01509-beaf-447d-821f-957cc5c20c0a
> > >    >> Status: Started
> > >    >> Number of Bricks: 1 x 3 = 3
> > >    >> Transport-type: tcp
> > >    >> Bricks:
> > >    >> Brick1: ipvr2.client:/mnt/storage/gluster
> > >    >> Brick2: ipvr3.client:/mnt/storage/gluster
> > >    >> Brick3: ipvr50.client:/mnt/storage/gluster
> > >    >> Options Reconfigured:
> > >    >> cluster.quorum-type: auto
> > >    >> cluster.server-quorum-type: server
> > >    >> network.remote-dio: enable
> > >    >> cluster.eager-lock: enable
> > >    >> performance.quick-read: off
> > >    >> performance.read-ahead: off
> > >    >> performance.io-cache: off
> > >    >> performance.stat-prefetch: off
> > >    >> features.shard: on
> > >    >> features.shard-block-size: 64MB
> > >    >> cluster.data-self-heal-algorithm: full
> > >    >> performance.readdir-ahead: on
> > >    >>
> > >    >>
> > >    >> As mentioned, I rebooted one of the nodes to test the freezing issue I
> > >    had
> > >    >> on previous versions and appart from the initial timeout, nothing, the
> > >    website
> > >    >> hosted on the VMs keeps working like a charm even during heal.
> > >    >> Since it's testing, there isn't any load on it though, and I just tried
> > >    to refresh
> > >    >> the database by importing the production one on the two MySQL VMs, and
> > >    both of them
> > >    >> started doing I/O errors. I tried shutting them down and powering them
> > >    on again,
> > >    >> but same thing, even starting full heals by hand doesn't solve the
> > >    problem, the disks are
> > >    >> corrupted. They still work, but sometimes they remount their partitions
> > >    read only ..
> > >    >>
> > >    >> I believe there is a few people already using 3.7.11, no one noticed
> > >    corruption problems ?
> > >    >> Anyone using Proxmox ? As already mentionned in multiple other threads
> > >    on this mailing list
> > >    >> by other users, I also have pretty much always shards in heal info, but
> > >    nothing "stuck" there,
> > >    >> they always go away in a few seconds getting replaced by other shards.
> > >    >>
> > >    >> Thanks
> > >    >>
> > >    >> --
> > >    >> Kevin Lemonnier
> > >    >> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > >    >
> > >    >
> > >    >
> > >    >> _______________________________________________
> > >    >> Gluster-users mailing list
> > >    >> Gluster-users@xxxxxxxxxxx
> > >    >> http://www.gluster.org/mailman/listinfo/gluster-users
> > >    >
> > >    >
> > >    > --
> > >    > Kevin Lemonnier
> > >    > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> > >    >
> > 
> > -- 
> > Kevin Lemonnier
> > PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
> 
> 
> 
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users@xxxxxxxxxxx
> > http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> -- 
> Kevin Lemonnier
> PGP Fingerprint : 89A5 2283 04A0 E6E9 0111

> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Kevin Lemonnier
PGP Fingerprint : 89A5 2283 04A0 E6E9 0111
Attachment:
signature.asc

Description: Digital signature
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users