Re: libgfapi failover problem on replica bricks

Joe Julian <joe@xxxxxxxxxxxxxxxx> · Wed, 16 Apr 2014 09:20:52 -0700

    libgfapi uses the same translators as the fuse client. That means
    you have the same client translator with the same behavior as any
    other client. Since the client translator connects to all servers,
    the loss of any one server without closing the tcp connection should
    result in the same ping-timeout->continued-use as any other
    client. Since this isn't happening, I would look to the client logs
    and/or network captures. There, as you know, is no "primary" nor
    "secondary" bricks. They're all equal. Failure to continue using any
    particular server suggests to me that maybe there's some problem
    there.

    I'll see if I can put together some sort of simulation today to test
    it myself though.

    On 4/16/2014 8:04 AM, Paul Penev wrote:

          I can easily reproduce the problem on this cluster. It appears that
there is a "primary" replica and a "secondary" replica.

If I reboot or kill the glusterfs process there is no problems on the
running VM.

        Good. That is as expected.

      Sorry, I was not clear enough. I meant that if I reboot the
"secondary" replica, there are no problems.

          If I reboot or "killall -KILL glusterfsd" the primary replica (so I
don't let it terminate properly), I can block the the VM each time.

        Have you followed my blog advise to prevent the vm from remounting the image filesystem read-only and waited ping-timeout seconds (42 by default)?

      I have not followed your advice, but there is a difference: I get i/o
errors *reading* from the disk. Once the problem kicks, I cannot issue
commands (like ls) because they can't be read.

There is a problem with that setup: It cannot be implemented on
windows machines (which are move vulnerable) and also cannot be
implemented on machines which I have no control on (customers).

          If I "reset" the VM it will not find the boot disk.

        Somewhat expected if within the ping-timeout.

      The issue persists beyond the ping-timeout. The KVM process needs to
be reinitialized. I guess libgfapi needs to reconnect from scratch.

          If I power down and power up the VM, then it will boot but will find
corruption on disk during the boot that requires fixing.

        Expected since the vm doesn't use the image filesystem synchronously. You can change that with mount options at the cost of performance.

      Ok. I understand this point.

        Unless you wait for ping-timeout and then continue writing the replica is actually still in sync. It's only out of sync if you write to one replica but not the other.

You can shorten the ping timeout. There is a cost to reconnection if you do.  Be sure to test a scenario with servers under production loads and see what the performance degradation during a reconnect is. Balance your needs appropriately.

      Could you please elaborate on the cost of reconnection? I will try to
run with a very short ping timeout (2sec) and see if the problem is in
the ping-timeout or perhaps not.

Paul

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users