Oops - there is a "brick ns" in my server configs that I should have removed - will do so now and verify results are the same. > Thanks Krishna! I moved the setup out of the diskless boot > cluster and am able to reproduce on regular machines. All > version and config information is below. The scenario is: > > 1. Start glusterfsd on both servers > 2. On client, mount gluster at /mnt/gluster 3. Run little > testing script on client to show status of the mount: > #!/bin/bash > while true > do > echo $(date) > sleep 1 > cat /mnt/gluster/etc/fstab > done > 4-A. ifconfig down on server1 - client logs no errors, no > delays (must be reading from server2) 4-B ifconfig down on > server2 - first time I tried = recovery in 5 seconds > 4-B " " - 2nd and 3rd times = client > hangs until I > manually kill the process > 4-C. Hard power off on server1 - client logs no errors, no > delays 4-D. Hard power off on server2, client hangs until I > manually kill the process > > The client logs the following in situation 4-B during a hang: > > 2008-04-29 08:36:50 W [client-protocol.c:204:call_bail] > master1: activating bail-out. pending frames = 1. last sent = > 2008-04-29 08:36:41. last received = 2008-04-29 08:36:40 > transport-timeout = 5 > 2008-04-29 08:36:50 C [client-protocol.c:211:call_bail] > master1: bailing transport > 2008-04-29 08:36:50 W [client-protocol.c:204:call_bail] > master2: activating bail-out. pending frames = 1. last sent = > 2008-04-29 08:36:41. last received = 2008-04-29 08:36:40 > transport-timeout = 5 > 2008-04-29 08:36:50 C [client-protocol.c:211:call_bail] > master2: bailing transport > 2008-04-29 08:36:50 W [client-protocol.c:4759:client_protocol_cleanup] > master2: cleaning up state in transport object 0x8e11968 > 2008-04-29 08:36:50 E [client-protocol.c:4809:client_protocol_cleanup] > master2: forced unwinding frame type(1) op(34) reply=@0x8e4a8e0 > 2008-04-29 08:36:50 E > [client-protocol.c:4405:client_lookup_cbk] master2: no proper > reply from server, returning ENOTCONN > 2008-04-29 08:36:50 W [client-protocol.c:4759:client_protocol_cleanup] > master1: cleaning up state in transport object 0x8e103b8 > 2008-04-29 08:36:50 E [client-protocol.c:4809:client_protocol_cleanup] > master1: forced unwinding frame type(1) op(34) reply=@0x8e4a9b0 > 2008-04-29 08:36:50 E > [client-protocol.c:4405:client_lookup_cbk] master1: no proper > reply from server, returning ENOTCONN > 2008-04-29 08:36:50 E [fuse-bridge.c:459:fuse_entry_cbk] > glusterfs-fuse: > 362: (34) /etc => -1 (107) > 2008-04-29 08:36:50 E > [client-protocol.c:324:client_protocol_xfer] master1: > transport_submit failed > 2008-04-29 08:36:50 W > [client-protocol.c:331:client_protocol_xfer] master2: > not connected at the moment to submit frame type(1) op(34) > 2008-04-29 08:36:50 E > [client-protocol.c:4405:client_lookup_cbk] master2: no proper > reply from server, returning ENOTCONN > 2008-04-29 08:37:43 E [tcp-client.c:190:tcp_connect] master2: > non-blocking > connect() returned: 113 (No route to host) > 2008-04-29 08:37:43 E [tcp-client.c:190:tcp_connect] master1: > non-blocking > connect() returned: 113 (No route to host) > 2008-04-29 08:39:13 E [tcp-client.c:190:tcp_connect] master2: > non-blocking > connect() returned: 113 (No route to host) > 2008-04-29 08:39:13 E [tcp-client.c:190:tcp_connect] master1: > non-blocking > connect() returned: 113 (No route to host) > > Version: > glusterfs 1.3.8pre6 built on Apr 28 2008 21:20:10 Repository > revision: glusterfs--mainline--2.5--patch-748 > > Config file on server1 and server2: > ------------------------- > volume storage1 > type storage/posix # POSIX FS translator > option directory / # Export this directory > end-volume > # > volume brick-ns > type storage/posix > option directory /ns > end-volume > # > volume server > type protocol/server > option transport-type tcp/server # For TCP/IP transport > subvolumes storage1 > option auth.ip.storage1.allow 192.168.20.* # Allow access > to "storage1" > volume > end-volume > ------------------------- > > Config file on client: > ------------------------- > volume master1 > type protocol/client > option transport-type tcp/client # for TCP/IP transport > option remote-host 192.168.20.140 # IP address of the > remote brick > option transport-timeout 5 > option remote-subvolume storage1 # name of the remote volume > end-volume > > volume master2 > type protocol/client > option transport-type tcp/client # for TCP/IP transport > option remote-host 192.168.20.141 # IP address of the > remote brick > option transport-timeout 5 > option remote-subvolume storage1 # name of the remote volume > end-volume > > volume data-afr > type cluster/afr > subvolumes master1 master2 > end-volume > ------------------------- > > > > Gerry, Christopher, > > > > Here is what I tried to do. Two servers, one client, simple > setup, afr > > on the client side. I did "ls" on client mount point, it > works, now I > > do "ifconfig eth0 down" > > on the server, next I do "ls" on client, it hangs for 10 > secs (timeout > > value) and fails over and starts working again without any problem. > > > > I guess few users are facing the problem you guys are facing. > > Can you give us your setup details and mention the exact steps to > > reproduce. Also try to come up with minimal config details > which can > > still reproduce the problem > > > > Thanks! > > Krishna > > > > On Sat, Apr 26, 2008 at 7:01 AM, Christopher Hawkins > > <chawkins@xxxxxxxxxxxxxxxxxxxx> wrote: > > > I am having the same issue. I'm working on a diskless > node cluster > > > and figured the issue was related to that since AFR > seems to fail > > > over nicely for everyone else... > > > But it seems I am not alone, so what can I do to help > troubleshoot? > > > > > > I have two servers exporting a brick each, and a client mounting > > > them both with AFR and no unify. Transport timeout > settings don't > > > seem to make a difference - client is just hung if I > power off or > > > just stop glusterfsd. There is nothing logged on the server side. > > > I'll use a usb thumb drive for client side logging since > > any logs in > > > the ramdisk obviously disappear after the reboot which > > fixes the hang... > > > If I get any insight from this I'll report it asap. > > > > > > Thanks, > > > Chris > > > > > > > > > > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > http://lists.nongnu.org/mailman/listinfo/gluster-devel >