Ok, here is one more hint that point in the direction of libgfapi not re-establishing the connections to the bricks after they come back online: if I migrate the KVM machine (live) from one node to another after the bricks are back online, and I kill the second brick, the KVM will not suffer from disk problems. It is obvious that during migration, the new process on the new node is forced to reconnect to the gluster volume, hence reestablishing both links. After this it is ready to loose one of the links without problems. Steps to replicate: 1. Start KVM VM and boot from a replicated volume 2. killall -KILL glusterfsd on one brick (brick1). Verify the the kvm is still working. 3. Bring back the glusterfsd on brick1. 4. heal the volume (gluster vol heal <vol>) and wait until gluster vol heal <vol> info shows no self-heal backlog. 5. Now migrate the KVM from one node to another node. 6. killall -KILL glusterfsd on the second brick (brick2). 7. Verify that KVM is still working (!) It would die from disk errors before, if step 5 was not executed. 8. Bring back glusterfsd on brick2, heal and enjoy. 9. repeat at will: The KVM will never die again, provided you migrate is once before brick failure. What this means to me: there's a problem in libgfapi, gluster 3.4.2 and 3.4.3 (at least) and/or kvm 1.7.1 (I'm running the latest 1.7 source tree in production). Joe: we're in your hands. I hope you find the problem somewhere. Paul. _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users