RE: AFR: machine crash hangs other mounts ortransport endpoint not connected

"Christopher Hawkins" <chawkins@xxxxxxxxxxxxxxxxxxxx> · Tue, 29 Apr 2008 08:51:03 -0400



Oops - there is a "brick ns" in my server configs that I should have removed
- will do so now and verify results are the same.  
 
> Thanks Krishna!  I moved the setup out of the diskless boot 
> cluster and am able to reproduce on regular machines. All 
> version and config information is below. The scenario is:
> 
> 1. Start glusterfsd on both servers
> 2. On client, mount gluster at /mnt/gluster 3. Run little 
> testing script on client to show status of the mount:
> #!/bin/bash
> while true
>  do
>    echo $(date)
>    sleep 1
>    cat /mnt/gluster/etc/fstab
>  done
> 4-A. ifconfig down on server1 - client logs no errors, no 
> delays (must be reading from server2) 4-B  ifconfig down on 
> server2 - first time I tried = recovery in 5 seconds
> 4-B  "                      " - 2nd and 3rd times = client 
> hangs until I
> manually kill the process
> 4-C. Hard power off on server1 - client logs no errors, no 
> delays 4-D. Hard power off on server2, client hangs until I 
> manually kill the process
> 
> The client logs the following in situation 4-B during a hang:
> 
> 2008-04-29 08:36:50 W [client-protocol.c:204:call_bail] 
> master1: activating bail-out. pending frames = 1. last sent = 
> 2008-04-29 08:36:41. last received = 2008-04-29 08:36:40 
> transport-timeout = 5
> 2008-04-29 08:36:50 C [client-protocol.c:211:call_bail] 
> master1: bailing transport
> 2008-04-29 08:36:50 W [client-protocol.c:204:call_bail] 
> master2: activating bail-out. pending frames = 1. last sent = 
> 2008-04-29 08:36:41. last received = 2008-04-29 08:36:40 
> transport-timeout = 5
> 2008-04-29 08:36:50 C [client-protocol.c:211:call_bail] 
> master2: bailing transport
> 2008-04-29 08:36:50 W [client-protocol.c:4759:client_protocol_cleanup]
> master2: cleaning up state in transport object 0x8e11968
> 2008-04-29 08:36:50 E [client-protocol.c:4809:client_protocol_cleanup]
> master2: forced unwinding frame type(1) op(34) reply=@0x8e4a8e0
> 2008-04-29 08:36:50 E 
> [client-protocol.c:4405:client_lookup_cbk] master2: no proper 
> reply from server, returning ENOTCONN
> 2008-04-29 08:36:50 W [client-protocol.c:4759:client_protocol_cleanup]
> master1: cleaning up state in transport object 0x8e103b8
> 2008-04-29 08:36:50 E [client-protocol.c:4809:client_protocol_cleanup]
> master1: forced unwinding frame type(1) op(34) reply=@0x8e4a9b0
> 2008-04-29 08:36:50 E 
> [client-protocol.c:4405:client_lookup_cbk] master1: no proper 
> reply from server, returning ENOTCONN
> 2008-04-29 08:36:50 E [fuse-bridge.c:459:fuse_entry_cbk] 
> glusterfs-fuse:
> 362: (34) /etc => -1 (107)
> 2008-04-29 08:36:50 E 
> [client-protocol.c:324:client_protocol_xfer] master1:
> transport_submit failed
> 2008-04-29 08:36:50 W 
> [client-protocol.c:331:client_protocol_xfer] master2:
> not connected at the moment to submit frame type(1) op(34)
> 2008-04-29 08:36:50 E 
> [client-protocol.c:4405:client_lookup_cbk] master2: no proper 
> reply from server, returning ENOTCONN
> 2008-04-29 08:37:43 E [tcp-client.c:190:tcp_connect] master2: 
> non-blocking
> connect() returned: 113 (No route to host)
> 2008-04-29 08:37:43 E [tcp-client.c:190:tcp_connect] master1: 
> non-blocking
> connect() returned: 113 (No route to host)
> 2008-04-29 08:39:13 E [tcp-client.c:190:tcp_connect] master2: 
> non-blocking
> connect() returned: 113 (No route to host)
> 2008-04-29 08:39:13 E [tcp-client.c:190:tcp_connect] master1: 
> non-blocking
> connect() returned: 113 (No route to host)
> 
> Version:
> glusterfs 1.3.8pre6 built on Apr 28 2008 21:20:10 Repository 
> revision: glusterfs--mainline--2.5--patch-748
> 
> Config file on server1 and server2:
> -------------------------
> volume storage1
>   type storage/posix                   # POSIX FS translator
>   option directory /        # Export this directory
> end-volume
> #
> volume brick-ns
>   type storage/posix
>   option directory /ns
> end-volume
> #
> volume server
>   type protocol/server
>   option transport-type tcp/server     # For TCP/IP transport
>   subvolumes storage1
>   option auth.ip.storage1.allow 192.168.20.* # Allow access 
> to "storage1"
> volume
> end-volume
> -------------------------
> 
> Config file on client:
> -------------------------
> volume master1
>   type protocol/client
>   option transport-type tcp/client     # for TCP/IP transport
>   option remote-host 192.168.20.140       # IP address of the 
> remote brick
>   option transport-timeout 5
>   option remote-subvolume storage1      # name of the remote volume
> end-volume
> 
> volume master2
>   type protocol/client
>   option transport-type tcp/client     # for TCP/IP transport
>   option remote-host 192.168.20.141       # IP address of the 
> remote brick
>   option transport-timeout 5
>   option remote-subvolume storage1        # name of the remote volume
> end-volume
> 
> volume data-afr
>   type cluster/afr
>   subvolumes master1 master2
> end-volume
> -------------------------
> 
>  
> > Gerry, Christopher,
> > 
> > Here is what I tried to do. Two servers, one client, simple 
> setup, afr 
> > on the client side. I did "ls" on client mount point, it 
> works, now I 
> > do "ifconfig eth0 down"
> > on the server, next I do "ls" on client, it hangs for 10 
> secs (timeout 
> > value) and fails over and starts working again without any problem.
> > 
> > I guess few users are facing the problem you guys are facing. 
> > Can you give us your setup details and mention the exact steps to 
> > reproduce. Also try to come up with minimal config details 
> which can 
> > still reproduce the problem
> > 
> > Thanks!
> > Krishna
> > 
> > On Sat, Apr 26, 2008 at 7:01 AM, Christopher Hawkins 
> > <chawkins@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > I am having the same issue. I'm working on a diskless  
> node cluster 
> > > and figured the issue was related to that  since AFR 
> seems to fail 
> > > over nicely for everyone else...
> > >  But it seems I am not alone, so what can I do to help 
> troubleshoot?
> > >
> > >  I have two servers exporting a brick each, and a client mounting 
> > > them both with AFR and no unify. Transport timeout 
> settings  don't 
> > > seem to make a difference - client is just hung if I 
> power off  or 
> > > just stop glusterfsd. There is nothing logged on the server side.
> > >  I'll use a usb thumb drive for client side logging since
> > any logs in
> > > the ramdisk obviously disappear after the reboot which
> > fixes the hang...
> > >  If I get any insight from this I'll report it asap.
> > >
> > >  Thanks,
> > >  Chris
> > >
> > >
> > >
> 
> 
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>