On Thu, Oct 22, 2015 at 08:45:04PM +0200, André Bauer wrote: > Hi, > > i have a 4 node Glusterfs 3.5.6 Cluster. > > My VM images are in an replicated distributed volume which is accessed > from kvm/qemu via libgfapi. > > Mount is against storage.domain.local which has IPs for all 4 Gluster > nodes set in DNS. > > When one of the Gluster nodes goes down (accidently reboot) a lot of the > vms getting read only filesystem. Even when the node comes back up. > > How can i prevent this? > I expect that the vm just uses the replicated file on the other node, > without getting ro fs. > > Any hints? There are at least two timeouts that are involved in this problem: 1. The filesystem in a VM can go read-only when the virtual disk where the filesystem is located does not respond for a while. 2. When a storage server that holds a replica of the virtual disk becomes unreachable, the Gluster client (qemu+libgfapi) waits for max. network.ping-timeout seconds before it resumes I/O. Once a filesystem in a VM goes read-only, you might be able to fsck and re-mount it read-writable again. It is not something a VM will do by itself. The timeouts for (1) are set in sysfs: $ cat /sys/block/sda/device/timeout 30 30 seconds is the default for SD-devices, and for testing you can change it with an echo: # echo 300 > /sys/block/sda/device/timeout This is not a peristent change, you can create a udev-rule to apply this change at bootup. Some of the filesystem offer a mount option that can change the behaviour after a disk error is detected. "man mount" shows the "errors" option for ext*. Changing this to "continue" is not recommended, "abort" or "panic" will be the most safe for your data. The timeout mentioned in (2) is for the Gluster Volume, and checked by the client. When a client does a write to a replicated volume, the write needs to be acknowledged by both/all replicas. The client (libgfapi) delays the reply to the application (qemu) until both/all replies from the replicas has been received. This delay is configured as the volume option network.ping-timeout (42 seconds by default). Now, if the VM returns block errors after 30 seconds, and the client waits up to 42 seconds for recovery, there is an issue... So, your solution could be to increase the timeout for error detection of the disks inside the VMs, and/or decrease the network.ping-timeout. It would be interesting to know if adapting these values prevents the read-only occurrences in your environment. If you do any testing with this, please keep me informed about the results. Niels
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel