Re: Rebooting gluster nodes make VMs pause due to storage error

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 10/27/2015 10:54 PM, Nicolás wrote:
Hi,

We're using ovirt 3.5.3.1, and as storage backend we use GlusterFS. We added a Storage Domain with the path "gluster.fqdn1:/volume", and as options, we used "backup-volfile-servers=gluster.fqdn2". We now need to restart both gluster.fqdn1 and gluster.fqdn2 machines due to system update (not at the same time, obviously). We're worried because in previous attempts, when restarted the main gluster node (gluster.fqdn1 in this case), all the VMs running against that storage backend got paused due to storage errors, and we couldn't resume them and finally had to power them off the hard way and start them again.

Gluster version on gluster.fqdn1 and gluster.fqdn2 is 3.6.3-1 (on CentOS7).

Gluster configuration for that volume is:

Volume Name: volume
Type: Replicate
Volume ID: a2d7e52c-2f63-4e72-9635-4e311baae6ff
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: gluster.fqdn1:/gluster/brick_01/brick
Brick2: gluster.fqdn2:/gluster/brick_01/brick
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: none
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off


Supported configuration for gluster storage domain in oVirt is replica 3 (echoing what Nir mentioned in ovirt users)

With "cluster.server-quorum-type: server" and a replica 2 setup, bringing down one of the nodes will cause bricks on the remaining server to be shut down too, and will cause VMs to pause. We strongly advise you to use a replica 3 configuration or an arbiter volume (http://gluster.readthedocs.org/en/release-3.7.0/Features/afr-arbiter-volumes/)


If adding an additional server is not an option, you could try this for present scenario -
1. turn off server quorum
2. Put host to maintenance in oVirt - and bring down gluster processes on host
3. perform maintenance activity
4. Trigger self-heal and wait for it complete
5. Put second host to maintenance in oVirt and repeat process



We would like to know if this configuration should work, or if there's something missing or some problem with the above specified version, as pausing the VMs is a way to make it not fail but is not affordable for us. Also, we've noted that the self-healing process takes *a lot* of time, the above specified volume is 6T and it might take hours to synchronize after a half-hour desynchronization.

The sharding feature available in gluster 3.7 will help with heal times. Promising results have been reported by other users - minutes as opposed to hours.


Any hints are appreciated,

Thanks.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users




[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux