libgfapi uses the same translators as the fuse client. That means
you have the same client translator with the same behavior as any
other client. Since the client translator connects to all servers,
the loss of any one server without closing the tcp connection should
result in the same ping-timeout->continued-use as any other
client. Since this isn't happening, I would look to the client logs
and/or network captures. There, as you know, is no "primary" nor
"secondary" bricks. They're all equal. Failure to continue using any
particular server suggests to me that maybe there's some problem
there. I'll see if I can put together some sort of simulation today to test it myself though. On 4/16/2014 8:04 AM, Paul Penev wrote:
I can easily reproduce the problem on this cluster. It appears that there is a "primary" replica and a "secondary" replica. If I reboot or kill the glusterfs process there is no problems on the running VM.Good. That is as expected.Sorry, I was not clear enough. I meant that if I reboot the "secondary" replica, there are no problems.If I reboot or "killall -KILL glusterfsd" the primary replica (so I don't let it terminate properly), I can block the the VM each time.Have you followed my blog advise to prevent the vm from remounting the image filesystem read-only and waited ping-timeout seconds (42 by default)?I have not followed your advice, but there is a difference: I get i/o errors *reading* from the disk. Once the problem kicks, I cannot issue commands (like ls) because they can't be read. There is a problem with that setup: It cannot be implemented on windows machines (which are move vulnerable) and also cannot be implemented on machines which I have no control on (customers).If I "reset" the VM it will not find the boot disk.Somewhat expected if within the ping-timeout.The issue persists beyond the ping-timeout. The KVM process needs to be reinitialized. I guess libgfapi needs to reconnect from scratch.If I power down and power up the VM, then it will boot but will find corruption on disk during the boot that requires fixing.Expected since the vm doesn't use the image filesystem synchronously. You can change that with mount options at the cost of performance.Ok. I understand this point.Unless you wait for ping-timeout and then continue writing the replica is actually still in sync. It's only out of sync if you write to one replica but not the other. You can shorten the ping timeout. There is a cost to reconnection if you do. Be sure to test a scenario with servers under production loads and see what the performance degradation during a reconnect is. Balance your needs appropriately.Could you please elaborate on the cost of reconnection? I will try to run with a very short ping timeout (2sec) and see if the problem is in the ping-timeout or perhaps not. Paul |
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users