Re: libgfapi failover problem on replica bricks

Paul Penev <ppquant@xxxxxxxxx> · Thu, 17 Apr 2014 17:52:47 +0200

Joe, this will be greatly appreciated.

All I see in the client logs is:

[2014-04-15 15:11:08.213748] W [socket.c:514:__socket_rwv]
0-pool-client-2: readv failed (No data available)
[2014-04-15 15:11:08.214165] W [socket.c:514:__socket_rwv]
0-pool-client-3: readv failed (No data available)
[2014-04-15 15:11:08.214596] W [socket.c:514:__socket_rwv]
0-pool-client-0: readv failed (No data available)
[2014-04-15 15:11:08.214941] W [socket.c:514:__socket_rwv]
0-pool-client-1: readv failed (No data available)
[2014-04-15 15:35:24.165391] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-15 15:35:24.165437] W
[socket.c:1962:__socket_proto_state_machine] 0-glusterfs: reading from
socket failed. Error (No data available), peer (127.0.0.1:24007)
[2014-04-15 15:35:34.419719] E [socket.c:2157:socket_connect_finish]
0-glusterfs: connection to 127.0.0.1:24007 failed (Connection refused)
[2014-04-15 15:35:34.419757] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-15 15:35:37.420492] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-15 15:35:39.330948] W [glusterfsd.c:1002:cleanup_and_exit]
(-->/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f9a705460ed]
(-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f9a70bf2b50] (-
[2014-04-15 15:37:52.849982] I [glusterfsd.c:1910:main]
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version
3.4.3 (/usr/sbin/glusterfs --volfile-id=pool
--volfile-server=localhost /mnt/pve
[2014-04-15 15:37:52.879574] I [socket.c:3480:socket_init]
0-glusterfs: SSL support is NOT enabled
[2014-04-15 15:37:52.879617] I [socket.c:3495:socket_init]
0-glusterfs: using system polling thread

Sometimes I see a lot of it:

[2014-04-16 13:29:29.521516] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:32.522267] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:35.523006] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:38.523773] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:41.524456] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:44.525324] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:47.526080] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:50.526819] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:53.527617] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:56.528228] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:29:59.529023] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)
[2014-04-16 13:30:02.529772] W [socket.c:514:__socket_rwv]
0-glusterfs: readv failed (No data available)

2014-04-16 18:20 GMT+02:00 Joe Julian <joe@xxxxxxxxxxxxxxxx>:
> libgfapi uses the same translators as the fuse client. That means you have
> the same client translator with the same behavior as any other client. Since
> the client translator connects to all servers, the loss of any one server
> without closing the tcp connection should result in the same
> ping-timeout->continued-use as any other client. Since this isn't happening,
> I would look to the client logs and/or network captures. There, as you know,
> is no "primary" nor "secondary" bricks. They're all equal. Failure to
> continue using any particular server suggests to me that maybe there's some
> problem there.
>
> I'll see if I can put together some sort of simulation today to test it
> myself though.
>
>
> On 4/16/2014 8:04 AM, Paul Penev wrote:
>
> I can easily reproduce the problem on this cluster. It appears that
> there is a "primary" replica and a "secondary" replica.
>
> If I reboot or kill the glusterfs process there is no problems on the
> running VM.
>
> Good. That is as expected.
>
> Sorry, I was not clear enough. I meant that if I reboot the
> "secondary" replica, there are no problems.
>
> If I reboot or "killall -KILL glusterfsd" the primary replica (so I
> don't let it terminate properly), I can block the the VM each time.
>
> Have you followed my blog advise to prevent the vm from remounting the image
> filesystem read-only and waited ping-timeout seconds (42 by default)?
>
> I have not followed your advice, but there is a difference: I get i/o
> errors *reading* from the disk. Once the problem kicks, I cannot issue
> commands (like ls) because they can't be read.
>
> There is a problem with that setup: It cannot be implemented on
> windows machines (which are move vulnerable) and also cannot be
> implemented on machines which I have no control on (customers).
>
> If I "reset" the VM it will not find the boot disk.
>
> Somewhat expected if within the ping-timeout.
>
> The issue persists beyond the ping-timeout. The KVM process needs to
> be reinitialized. I guess libgfapi needs to reconnect from scratch.
>
> If I power down and power up the VM, then it will boot but will find
> corruption on disk during the boot that requires fixing.
>
> Expected since the vm doesn't use the image filesystem synchronously. You
> can change that with mount options at the cost of performance.
>
> Ok. I understand this point.
>
> Unless you wait for ping-timeout and then continue writing the replica is
> actually still in sync. It's only out of sync if you write to one replica
> but not the other.
>
> You can shorten the ping timeout. There is a cost to reconnection if you do.
> Be sure to test a scenario with servers under production loads and see what
> the performance degradation during a reconnect is. Balance your needs
> appropriately.
>
> Could you please elaborate on the cost of reconnection? I will try to
> run with a very short ping timeout (2sec) and see if the problem is in
> the ping-timeout or perhaps not.
>
> Paul
>
>
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users