Stop the press - it was my error. I had some ip manipulation going on in the background that I had forgotten about. When I removed that and went to a normal setup, everything works after 5 seconds. Ifconfig eth0 down and hard power off both show the same result on either server - 5 second delay and then back in business. :-) I had IP's failing over with heartbeat. Apparently when the failed IP from server2 shows up on server1, glusterfs doesn't like it very much and it hangs rather than reconnecting. Is this the expected behavior? Since the server is more or less stateless, I would think that this would not cause a hang on the client. > -----Original Message----- > From: > gluster-devel-bounces+chawkins=veracitynetworks.com@xxxxxxxxxx > > [mailto:gluster-devel-bounces+chawkins=veracitynetworks.com@no ngnu.org] On Behalf Of Christopher Hawkins > Sent: Tuesday, April 29, 2008 8:51 AM > To: gluster-devel@xxxxxxxxxx > Subject: RE: AFR: machine crash hangs other > mounts ortransportendpoint not connected > > Oops - there is a "brick ns" in my server configs that I > should have removed > - will do so now and verify results are the same. > > > Thanks Krishna! I moved the setup out of the diskless boot cluster > > and am able to reproduce on regular machines. All version > and config > > information is below. The scenario is: > > > > 1. Start glusterfsd on both servers > > 2. On client, mount gluster at /mnt/gluster 3. Run little testing > > script on client to show status of the mount: > > #!/bin/bash > > while true > > do > > echo $(date) > > sleep 1 > > cat /mnt/gluster/etc/fstab > > done > > 4-A. ifconfig down on server1 - client logs no errors, no > delays (must > > be reading from server2) 4-B ifconfig down on > > server2 - first time I tried = recovery in 5 seconds > > 4-B " " - 2nd and 3rd times = client > > hangs until I > > manually kill the process > > 4-C. Hard power off on server1 - client logs no errors, no > delays 4-D. > > Hard power off on server2, client hangs until I manually kill the > > process > > > > The client logs the following in situation 4-B during a hang: > > > > 2008-04-29 08:36:50 W [client-protocol.c:204:call_bail] > > master1: activating bail-out. pending frames = 1. last sent = > > 2008-04-29 08:36:41. last received = 2008-04-29 08:36:40 > > transport-timeout = 5 > > 2008-04-29 08:36:50 C [client-protocol.c:211:call_bail] > > master1: bailing transport > > 2008-04-29 08:36:50 W [client-protocol.c:204:call_bail] > > master2: activating bail-out. pending frames = 1. last sent = > > 2008-04-29 08:36:41. last received = 2008-04-29 08:36:40 > > transport-timeout = 5 > > 2008-04-29 08:36:50 C [client-protocol.c:211:call_bail] > > master2: bailing transport > > 2008-04-29 08:36:50 W > [client-protocol.c:4759:client_protocol_cleanup] > > master2: cleaning up state in transport object 0x8e11968 > > 2008-04-29 08:36:50 E > [client-protocol.c:4809:client_protocol_cleanup] > > master2: forced unwinding frame type(1) op(34) reply=@0x8e4a8e0 > > 2008-04-29 08:36:50 E > > [client-protocol.c:4405:client_lookup_cbk] master2: no proper reply > > from server, returning ENOTCONN > > 2008-04-29 08:36:50 W > [client-protocol.c:4759:client_protocol_cleanup] > > master1: cleaning up state in transport object 0x8e103b8 > > 2008-04-29 08:36:50 E > [client-protocol.c:4809:client_protocol_cleanup] > > master1: forced unwinding frame type(1) op(34) reply=@0x8e4a9b0 > > 2008-04-29 08:36:50 E > > [client-protocol.c:4405:client_lookup_cbk] master1: no proper reply > > from server, returning ENOTCONN > > 2008-04-29 08:36:50 E [fuse-bridge.c:459:fuse_entry_cbk] > > glusterfs-fuse: > > 362: (34) /etc => -1 (107) > > 2008-04-29 08:36:50 E > > [client-protocol.c:324:client_protocol_xfer] master1: > > transport_submit failed > > 2008-04-29 08:36:50 W > > [client-protocol.c:331:client_protocol_xfer] master2: > > not connected at the moment to submit frame type(1) op(34) > > 2008-04-29 08:36:50 E > > [client-protocol.c:4405:client_lookup_cbk] master2: no proper reply > > from server, returning ENOTCONN > > 2008-04-29 08:37:43 E [tcp-client.c:190:tcp_connect] master2: > > non-blocking > > connect() returned: 113 (No route to host) > > 2008-04-29 08:37:43 E [tcp-client.c:190:tcp_connect] master1: > > non-blocking > > connect() returned: 113 (No route to host) > > 2008-04-29 08:39:13 E [tcp-client.c:190:tcp_connect] master2: > > non-blocking > > connect() returned: 113 (No route to host) > > 2008-04-29 08:39:13 E [tcp-client.c:190:tcp_connect] master1: > > non-blocking > > connect() returned: 113 (No route to host) > > > > Version: > > glusterfs 1.3.8pre6 built on Apr 28 2008 21:20:10 Repository > > revision: glusterfs--mainline--2.5--patch-748 > > > > Config file on server1 and server2: > > ------------------------- > > volume storage1 > > type storage/posix # POSIX FS translator > > option directory / # Export this directory > > end-volume > > # > > volume brick-ns > > type storage/posix > > option directory /ns > > end-volume > > # > > volume server > > type protocol/server > > option transport-type tcp/server # For TCP/IP transport > > subvolumes storage1 > > option auth.ip.storage1.allow 192.168.20.* # Allow access to > > "storage1" > > volume > > end-volume > > ------------------------- > > > > Config file on client: > > ------------------------- > > volume master1 > > type protocol/client > > option transport-type tcp/client # for TCP/IP transport > > option remote-host 192.168.20.140 # IP address of the > > remote brick > > option transport-timeout 5 > > option remote-subvolume storage1 # name of the remote volume > > end-volume > > > > volume master2 > > type protocol/client > > option transport-type tcp/client # for TCP/IP transport > > option remote-host 192.168.20.141 # IP address of the > > remote brick > > option transport-timeout 5 > > option remote-subvolume storage1 # name of the > remote volume > > end-volume > > > > volume data-afr > > type cluster/afr > > subvolumes master1 master2 > > end-volume > > ------------------------- > > > > > > > Gerry, Christopher, > > > > > > Here is what I tried to do. Two servers, one client, simple > > setup, afr > > > on the client side. I did "ls" on client mount point, it > > works, now I > > > do "ifconfig eth0 down" > > > on the server, next I do "ls" on client, it hangs for 10 > > secs (timeout > > > value) and fails over and starts working again without > any problem. > > > > > > I guess few users are facing the problem you guys are facing. > > > Can you give us your setup details and mention the exact steps to > > > reproduce. Also try to come up with minimal config details > > which can > > > still reproduce the problem > > > > > > Thanks! > > > Krishna > > > > > > On Sat, Apr 26, 2008 at 7:01 AM, Christopher Hawkins > > > <chawkins@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > I am having the same issue. I'm working on a diskless > > node cluster > > > > and figured the issue was related to that since AFR > > seems to fail > > > > over nicely for everyone else... > > > > But it seems I am not alone, so what can I do to help > > troubleshoot? > > > > > > > > I have two servers exporting a brick each, and a > client mounting > > > > them both with AFR and no unify. Transport timeout > > settings don't > > > > seem to make a difference - client is just hung if I > > power off or > > > > just stop glusterfsd. There is nothing logged on the > server side. > > > > I'll use a usb thumb drive for client side logging since > > > any logs in > > > > the ramdisk obviously disappear after the reboot which > > > fixes the hang... > > > > If I get any insight from this I'll report it asap. > > > > > > > > Thanks, > > > > Chris > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxx > > http://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > http://lists.nongnu.org/mailman/listinfo/gluster-devel >