Fwd: [Gluster-users] HA problems

Christopher Hawkins <chawkins@xxxxxxxxxxx> · Tue, 11 May 2010 19:31:58 -0400 (EDT)

Probably should have sent this to the devel list originally. Dev's, any thoughts on this Stale NFS file handle issue? I would love to hear any suggestions. This failure to recover seems like it should be important and I need help to troubleshoot.   

Thanks very much, 
Chris

----- Forwarded Message -----
From: "Christopher Hawkins" <chawkins@xxxxxxxxxxx>
To: "gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Friday, May 7, 2010 1:25:36 PM (GMT-0500) Auto-Detected
Subject: Re: [Gluster-users] HA problems

Adding information to my post:

Before trying the HA translator, I said that the client does not recover and reports an endless string of stale NFS file handles. Here's the relevant parts of the log file from that scenario:

[... vol file ...]

 7: ### Add client feature and attach to remote subvolume
  8: volume master_root
  9:   type protocol/client
 10:   option transport-type tcp
 11:   option remote-host 192.168.1.99    #this is the virtual IP
 12:   option transport.socket.nodelay on
 13:   option ping-timeout 5
 14:   option remote-port 3399
 15:   option remote-subvolume threads1
 16: end-volume

[..... starts off ok ......]

[2010-05-07 09:13:55] N [glusterfsd.c:1408:main] glusterfs: Successfully started
[2010-05-07 09:13:55] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.99:3399, attached to remote volume
'threads1'.

[..... some lookup errors, probably related to this being a shared root cluster? .....]

[2010-05-07 09:16:59] W [fuse-bridge.c:491:fuse_entry_cbk] glusterfs-fuse: LOOKUP(/sbin/mkfs.vfat) inode (ptr=0x90d6538, ino=35520686, gen=54
67131851820775419) found conflict (ptr=0x90d62f8, ino=35520686, gen=5467131851820775419)
[2010-05-07 09:17:00] W [fuse-bridge.c:491:fuse_entry_cbk] glusterfs-fuse: LOOKUP(/sbin/tune2fs) inode (ptr=0x90da030, ino=35520696, gen=5467
131851820775464) found conflict (ptr=0x90d35a8, ino=35520696, gen=5467131851820775464)
[2010-05-07 09:17:20] W [fuse-bridge.c:1848:fuse_readv_cbk] glusterfs-fuse: 200101: READ => -1 (Invalid cross-device link)
[2010-05-07 09:17:20] W [fuse-bridge.c:1848:fuse_readv_cbk] glusterfs-fuse: 200104: READ => -1 (Invalid cross-device link)

[..... Now I kill the server and the IP fails over to an identical box .....] 

[2010-05-07 09:20:45] E [client-protocol.c:415:client_ping_timer_expired] master_root: Server 192.168.1.99:3399 has not responded in the last
 5 seconds, disconnecting.

[..... briefly we have transport endpoint error .....]

[2010-05-07 09:20:47] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209367: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-05-07 09:20:47] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209368: LOOKUP() / => -1 (Transport endpoint is not connected)

[..... then the client process reconnects to other server which now has the .99 IP ......]

[2010-05-07 09:20:48] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.99:3399, attached to remote volume
'threads1'.
[2010-05-07 09:20:48] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.99:3399, attached to remote volume
'threads1'.

[...... every attempt to use the mountpoint now produces stale NFS error - in prev versions this did not happen.....]

[2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209369: LOOKUP() / => -1 (Stale NFS file handle)
[2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209370: LOOKUP() / => -1 (Stale NFS file handle)
[2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209371: LOOKUP() / => -1 (Stale NFS file handle)
[2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209372: LOOKUP() / => -1 (Stale NFS file handle)

[root@node1 ~]# ls /shared_root
ls: /shared_root: Stale NFS file handle

----- "Christopher Hawkins" <chawkins@xxxxxxxxxxx> wrote:

> Hello, I have a problem now that was previously solved. In a simple
> setup with two servers and one client, the way I had things configured
> was that the client connected to a virtual IP that could fail back and
> forth to whatever server was available. This used to work. But I have
> not tested since 2.09 until today... And now instead of recovering
> after a brief timeout, the client never recovers and reports endless
> Stale NFS File handle errors in its log (though there is no NFS
> involved, just native gluster client). 
> 
> So I tried the HA translator from testing. This also does not work.
> After I kill the primary server (listed first in the config file), an
> ls of the mount point hangs for a moment and then reports:
>  
> [root@server2 glusterfs]# ls /mnt/test
> ls: /mnt/test: Input/output error
> 
> Each attempted ls produces two errors in the client log as well, a
> "Transport endpoint is not connected" error followed by the
> "Input/output error". 
> 
> 
> The client log shows this:
> 
> [2010-05-07 12:03:44] N [glusterfsd.c:1408:main] glusterfs:
> Successfully started
> [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk]
> master2_root: Connected to 192.168.1.92:3399, attached to remote
> volume 'threads1'.
> [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk]
> master_root: Connected to 192.168.1.91:3399, attached to remote volume
> 'threads1'.
> [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk]
> master2_root: Connected to 192.168.1.92:3399, attached to remote
> volume 'threads1'.
> [2010-05-07 12:03:44] N [fuse-bridge.c:2950:fuse_init] glusterfs-fuse:
> FUSE inited with protocol versions: glusterfs 7.13 kernel 7.10
> [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk]
> master_root: Connected to 192.168.1.91:3399, attached to remote volume
> 'threads1'.
> 
> 
> [....here I killed the primary server....]
> 
> [2010-05-07 12:06:17] E
> [client-protocol.c:415:client_ping_timer_expired] master_root: Server
> 192.168.1.91:3399 has not responded in the last 42 seconds,
> disconnecting.
> [2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind]
> master_root: forced unwinding frame type(1) op(LOOKUP)
> [2010-05-07 12:06:17] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:17] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 10: LOOKUP() / => -1 (Input/output error)
> [2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind]
> master_root: forced unwinding frame type(2) op(PING)
> [2010-05-07 12:06:17] N [client-protocol.c:6994:notify] master_root:
> disconnected
> [2010-05-07 12:06:17] E [socket.c:762:socket_connect_finish]
> master_root: connection to 192.168.1.91:3399 failed (No route to
> host)
> [2010-05-07 12:06:21] E [socket.c:762:socket_connect_finish]
> master_root: connection to 192.168.1.91:3399 failed (No route to
> host)
> [2010-05-07 12:06:21] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:21] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 11: LOOKUP() / => -1 (Input/output error)
> [2010-05-07 12:06:24] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:24] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 12: LOOKUP() / => -1 (Input/output error)
> [2010-05-07 12:06:26] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:26] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 13: LOOKUP() / => -1 (Input/output error)
> [2010-05-07 12:06:39] E [ha.c:125:ha_lookup_cbk] ha:
> (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not
> connected)
> [2010-05-07 12:06:39] W [fuse-bridge.c:722:fuse_attr_cbk]
> glusterfs-fuse: 14: LOOKUP() / => -1 (Input/output error)
> 
> [.... here I powered the primary server back on....] 
> 
> [2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk]
> master_root: Connected to 192.168.1.91:3399, attached to remote volume
> 'threads1'.
> [2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk]
> master_root: Connected to 192.168.1.91:3399, attached to remote volume
> 'threads1'.
> --------- end log ------------
> 
> And after it came back, the client recovered and everything picked
> back up. But it seems I cannot get the client to consider any server
> other than the first one it connects to. I assume that if failing the
> primary servers IP address to another box doesn't work, then round
> robin DNS will also not work since they are essentially the same
> method (a different server with the same address). And since this used
> to work, this seems to be an unintended result. 
> 
> The server vol file has a single export and io-threads, and the client
> has just the two remote-subvolumes and the ha declaration like so:
> 
> volume ha
>    type cluster/ha
>    subvolumes master_root master2_root
> end-volume 
> 
> Code base is Glusterfs version 3.04 compiled from source this morning.
> How can I troubleshoot?
> 
> Christopher Hawkins
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users