Probably should have sent this to the devel list originally. Dev's, any thoughts on this Stale NFS file handle issue? I would love to hear any suggestions. This failure to recover seems like it should be important and I need help to troubleshoot. Thanks very much, Chris ----- Forwarded Message ----- From: "Christopher Hawkins" <chawkins@xxxxxxxxxxx> To: "gluster-users" <gluster-users@xxxxxxxxxxx> Sent: Friday, May 7, 2010 1:25:36 PM (GMT-0500) Auto-Detected Subject: Re: [Gluster-users] HA problems Adding information to my post: Before trying the HA translator, I said that the client does not recover and reports an endless string of stale NFS file handles. Here's the relevant parts of the log file from that scenario: [... vol file ...] 7: ### Add client feature and attach to remote subvolume 8: volume master_root 9: type protocol/client 10: option transport-type tcp 11: option remote-host 192.168.1.99 #this is the virtual IP 12: option transport.socket.nodelay on 13: option ping-timeout 5 14: option remote-port 3399 15: option remote-subvolume threads1 16: end-volume [..... starts off ok ......] [2010-05-07 09:13:55] N [glusterfsd.c:1408:main] glusterfs: Successfully started [2010-05-07 09:13:55] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.99:3399, attached to remote volume 'threads1'. [..... some lookup errors, probably related to this being a shared root cluster? .....] [2010-05-07 09:16:59] W [fuse-bridge.c:491:fuse_entry_cbk] glusterfs-fuse: LOOKUP(/sbin/mkfs.vfat) inode (ptr=0x90d6538, ino=35520686, gen=54 67131851820775419) found conflict (ptr=0x90d62f8, ino=35520686, gen=5467131851820775419) [2010-05-07 09:17:00] W [fuse-bridge.c:491:fuse_entry_cbk] glusterfs-fuse: LOOKUP(/sbin/tune2fs) inode (ptr=0x90da030, ino=35520696, gen=5467 131851820775464) found conflict (ptr=0x90d35a8, ino=35520696, gen=5467131851820775464) [2010-05-07 09:17:20] W [fuse-bridge.c:1848:fuse_readv_cbk] glusterfs-fuse: 200101: READ => -1 (Invalid cross-device link) [2010-05-07 09:17:20] W [fuse-bridge.c:1848:fuse_readv_cbk] glusterfs-fuse: 200104: READ => -1 (Invalid cross-device link) [..... Now I kill the server and the IP fails over to an identical box .....] [2010-05-07 09:20:45] E [client-protocol.c:415:client_ping_timer_expired] master_root: Server 192.168.1.99:3399 has not responded in the last 5 seconds, disconnecting. [..... briefly we have transport endpoint error .....] [2010-05-07 09:20:47] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209367: LOOKUP() / => -1 (Transport endpoint is not connected) [2010-05-07 09:20:47] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209368: LOOKUP() / => -1 (Transport endpoint is not connected) [..... then the client process reconnects to other server which now has the .99 IP ......] [2010-05-07 09:20:48] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.99:3399, attached to remote volume 'threads1'. [2010-05-07 09:20:48] N [client-protocol.c:6246:client_setvolume_cbk] master_root: Connected to 192.168.1.99:3399, attached to remote volume 'threads1'. [...... every attempt to use the mountpoint now produces stale NFS error - in prev versions this did not happen.....] [2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209369: LOOKUP() / => -1 (Stale NFS file handle) [2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209370: LOOKUP() / => -1 (Stale NFS file handle) [2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209371: LOOKUP() / => -1 (Stale NFS file handle) [2010-05-07 09:20:51] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 209372: LOOKUP() / => -1 (Stale NFS file handle) [root@node1 ~]# ls /shared_root ls: /shared_root: Stale NFS file handle ----- "Christopher Hawkins" <chawkins@xxxxxxxxxxx> wrote: > Hello, I have a problem now that was previously solved. In a simple > setup with two servers and one client, the way I had things configured > was that the client connected to a virtual IP that could fail back and > forth to whatever server was available. This used to work. But I have > not tested since 2.09 until today... And now instead of recovering > after a brief timeout, the client never recovers and reports endless > Stale NFS File handle errors in its log (though there is no NFS > involved, just native gluster client). > > So I tried the HA translator from testing. This also does not work. > After I kill the primary server (listed first in the config file), an > ls of the mount point hangs for a moment and then reports: > > [root@server2 glusterfs]# ls /mnt/test > ls: /mnt/test: Input/output error > > Each attempted ls produces two errors in the client log as well, a > "Transport endpoint is not connected" error followed by the > "Input/output error". > > > The client log shows this: > > [2010-05-07 12:03:44] N [glusterfsd.c:1408:main] glusterfs: > Successfully started > [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] > master2_root: Connected to 192.168.1.92:3399, attached to remote > volume 'threads1'. > [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] > master_root: Connected to 192.168.1.91:3399, attached to remote volume > 'threads1'. > [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] > master2_root: Connected to 192.168.1.92:3399, attached to remote > volume 'threads1'. > [2010-05-07 12:03:44] N [fuse-bridge.c:2950:fuse_init] glusterfs-fuse: > FUSE inited with protocol versions: glusterfs 7.13 kernel 7.10 > [2010-05-07 12:03:44] N [client-protocol.c:6246:client_setvolume_cbk] > master_root: Connected to 192.168.1.91:3399, attached to remote volume > 'threads1'. > > > [....here I killed the primary server....] > > [2010-05-07 12:06:17] E > [client-protocol.c:415:client_ping_timer_expired] master_root: Server > 192.168.1.91:3399 has not responded in the last 42 seconds, > disconnecting. > [2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind] > master_root: forced unwinding frame type(1) op(LOOKUP) > [2010-05-07 12:06:17] E [ha.c:125:ha_lookup_cbk] ha: > (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not > connected) > [2010-05-07 12:06:17] W [fuse-bridge.c:722:fuse_attr_cbk] > glusterfs-fuse: 10: LOOKUP() / => -1 (Input/output error) > [2010-05-07 12:06:17] E [saved-frames.c:165:saved_frames_unwind] > master_root: forced unwinding frame type(2) op(PING) > [2010-05-07 12:06:17] N [client-protocol.c:6994:notify] master_root: > disconnected > [2010-05-07 12:06:17] E [socket.c:762:socket_connect_finish] > master_root: connection to 192.168.1.91:3399 failed (No route to > host) > [2010-05-07 12:06:21] E [socket.c:762:socket_connect_finish] > master_root: connection to 192.168.1.91:3399 failed (No route to > host) > [2010-05-07 12:06:21] E [ha.c:125:ha_lookup_cbk] ha: > (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not > connected) > [2010-05-07 12:06:21] W [fuse-bridge.c:722:fuse_attr_cbk] > glusterfs-fuse: 11: LOOKUP() / => -1 (Input/output error) > [2010-05-07 12:06:24] E [ha.c:125:ha_lookup_cbk] ha: > (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not > connected) > [2010-05-07 12:06:24] W [fuse-bridge.c:722:fuse_attr_cbk] > glusterfs-fuse: 12: LOOKUP() / => -1 (Input/output error) > [2010-05-07 12:06:26] E [ha.c:125:ha_lookup_cbk] ha: > (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not > connected) > [2010-05-07 12:06:26] W [fuse-bridge.c:722:fuse_attr_cbk] > glusterfs-fuse: 13: LOOKUP() / => -1 (Input/output error) > [2010-05-07 12:06:39] E [ha.c:125:ha_lookup_cbk] ha: > (child=master_root) (op_ret=-1 op_errno=Transport endpoint is not > connected) > [2010-05-07 12:06:39] W [fuse-bridge.c:722:fuse_attr_cbk] > glusterfs-fuse: 14: LOOKUP() / => -1 (Input/output error) > > [.... here I powered the primary server back on....] > > [2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk] > master_root: Connected to 192.168.1.91:3399, attached to remote volume > 'threads1'. > [2010-05-07 12:07:07] N [client-protocol.c:6246:client_setvolume_cbk] > master_root: Connected to 192.168.1.91:3399, attached to remote volume > 'threads1'. > --------- end log ------------ > > And after it came back, the client recovered and everything picked > back up. But it seems I cannot get the client to consider any server > other than the first one it connects to. I assume that if failing the > primary servers IP address to another box doesn't work, then round > robin DNS will also not work since they are essentially the same > method (a different server with the same address). And since this used > to work, this seems to be an unintended result. > > The server vol file has a single export and io-threads, and the client > has just the two remote-subvolumes and the ha declaration like so: > > volume ha > type cluster/ha > subvolumes master_root master2_root > end-volume > > Code base is Glusterfs version 3.04 compiled from source this morning. > How can I troubleshoot? > > Christopher Hawkins > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users