Hi. I've been trying to find out what's going on for several days now, but can't find anything myself, so I'm asking for some help with GlusterFS experts ;-) I'm running 3 replicated gluster volumes between 2 nodes (each node hosting 3 bricks: one per volume). Components involved: - CentOS 7.0 x86_64 / 3.10.0-123.20.1 - GlusterFS 3.5.3 (yes, I should upgrade, I know). This is used to host qemu-kvm VM. (1 GlusterFS volume for VM images, 1 for libvirt locks, 1 for VM states, eg virsh save vm1 can be restored on the other node). The VM are hosted on the GlusterFS server itself (each node fuse-mount the storage volume on /var/lib/libvirt/images). So they are both GlusterFS server and client. VM are running only on the first node (but can be live migrated to the second one in case of problem). The 3 volumes (vmstore, save and locks) have the same configuration: [root@master1 ~]# gluster vol info vmstore Volume Name: vmstore Type: Replicate Volume ID: 7ed967f1-3b33-46d7-8908-0bb78c6e9199 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: master1:/mnt/bricks/vmstore Brick2: master2:/mnt/bricks/vmstore Options Reconfigured: diagnostics.client-log-level: DEBUG diagnostics.brick-log-level: INFO cluster.eager-lock: on network.frame-timeout: 300 network.ping-timeout: 20 nfs.disable: on This setup worked well for more than a year, but had a big failure 3 months ago: all my VM had a kernel panic because they couldn't access their storage anymore. Looking at my logs, I saw that gluster fuse client lost connection with both bricks because they had not responded for more than 5 sec (which was the network.ping-timeout at this time). I don't really understand how this could happen as the network was OK, and anyway, one of the bricks is running on 127.0.0.1 so definitely not a network issue. I've increased network.ping-timeout to 20 sec, which allowed all my VM to be started again without connection to bricks being lost. Now, things are working, but since this day, I have random IO hanging from time to time. When the problem occurs, all IO in all the VM is hanged, the load on the hypervisor (which is also the GlusterFS client and one of the bricks) goes crazy (I've seen up to ~120). The load goes so high I can't do anything on the hypervisor, I loose my SSH access which doesn't respond anymore. The problem last for 5 or 10 minutes, then everything start working again (Some VM doesn't like being stuck for that long and need to be restarted). The problem is very random, can happen every 2 days, as everything can be working without a single issue for more than 3 weeks. It doesn't depend on the load, nor on the access pattern. I suspect something in Gluster to be the culprit, but I can't find anything. I've enabled DEBUG logging on the client (but not on the brick as it just too verbose), and will see if I can get more info next time the issue happens. I first noticed the problem always happened when I executed a monitoring script (which executed several gluster commands and parsed it's output to check the different volume status, script available here [1] if anyone is interested), but I've now completely disabled monitoring, and I still have this random issue. A strange thing I've noticed is that the main volume (the one storing the VM images) continuously shows files being healed if I look at: gluster vol heal vmstore info healed I see every 10 (exactly 10) minutes a few VM images being healed. But nothing in the client logs, nor the system loads indicate heal taking place. I'm lost and don't know where to look, I'd really appreciate some help :-) (we're ready to hire a GlusterFS expert to help us sorting this out if necessary, this is a critical installation for us) [1]: https://gitweb.firewall-services.com/?p=zabbix-agent-addons;a=blob_plain;f=zabbix_scripts/check_gluster_sudo;hb=HEAD --
|
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users