Re: GlusterFS 3.6.1 breaks VM images on cluster node restart

Joe Julian <joe@xxxxxxxxxxxxxxxx> · Mon, 08 Jun 2015 13:10:19 -0700

"Unfortunately, when I restart every node in the cluster 
sequentially...qemu image of the HA VM gets corrupted..."
Even client nodes?

Make sure that your client can connect to all of the servers.

Make sure, after you restart a server, that the self-heal finishes 
before you restart the next one. What I suspect is happening is that you 
restart server A, writes happen on server B. You restart server B before 
the heal has happened to copy the changes from server A to server B, 
thus causing the client to write changes to server B. When server A 
comes back, both server A and server B think they have changes for the 
other. This is a classic split-brain state.

On 06/04/2015 07:08 AM, Roger Lehmann wrote:
Hello, I'm having a serious problem with my GlusterFS cluster.
I'm using Proxmox 3.4 for high available VM management which works 
with GlusterFS as storage.
Unfortunately, when I restart every node in the cluster sequentially 
one by one (with online migration of the running HA VM first of 
course) the qemu image of the HA VM gets corrupted and the VM itself 
has problems accessing it.

May 15 10:35:09 blog kernel: [339003.942602] end_request: I/O error, 
dev vda, sector 2048
May 15 10:35:09 blog kernel: [339003.942829] Buffer I/O error on 
device vda1, logical block 0
May 15 10:35:09 blog kernel: [339003.942929] lost page write due to 
I/O error on vda1
May 15 10:35:09 blog kernel: [339003.942952] end_request: I/O error, 
dev vda, sector 2072
May 15 10:35:09 blog kernel: [339003.943049] Buffer I/O error on 
device vda1, logical block 3
May 15 10:35:09 blog kernel: [339003.943146] lost page write due to 
I/O error on vda1
May 15 10:35:09 blog kernel: [339003.943153] end_request: I/O error, 
dev vda, sector 4196712
May 15 10:35:09 blog kernel: [339003.943251] Buffer I/O error on 
device vda1, logical block 524333
May 15 10:35:09 blog kernel: [339003.943350] lost page write due to 
I/O error on vda1
May 15 10:35:09 blog kernel: [339003.943363] end_request: I/O error, 
dev vda, sector 4197184

After the image is broken, it's impossible to migrate the VM or start 
it when it's down.

root@pve2 ~ # gluster volume heal pve-vol info
Gathering list of entries to be healed on volume pve-vol has been 
successful

Brick pve1:/var/lib/glusterd/brick
Number of entries: 1
/images//200/vm-200-disk-1.qcow2

Brick pve2:/var/lib/glusterd/brick
Number of entries: 1
/images/200/vm-200-disk-1.qcow2

Brick pve3:/var/lib/glusterd/brick
Number of entries: 1
/images//200/vm-200-disk-1.qcow2

I couldn't really reproduce this in my test environment with GlusterFS 
3.6.2 but I had other problems while testing (may also be because of a 
virtualized test environment), so I don't want to upgrade to 3.6.2 
until I definitely know the problems I encountered are fixed in 3.6.2.
Anybody else experienced this problem? I'm not sure if issue 1161885 
(Possible file corruption on dispersed volumes) is the issue I'm 
experiencing. I have a 3 node replicate cluster.
Thanks for your help!

Regards,
Roger Lehmann
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users