RE: AFR: machine crash hangs other mountsortransportendpoint not connected

"Christopher Hawkins" <chawkins@xxxxxxxxxxxxxxxxxxxx> · Thu, 8 May 2008 07:03:51 -0400

When I set up a minimal ramdisk environment with no fancy directory
re-mapping going on, the failover works. I had to work on other things for a
few days and have not resolved it 100% yet, but it appears that my setup is
creating the issue, not glusterfs. But thank you for the follow up! If I
suspect a gluster issue I will reopen the thread.

Chris

> -----Original Message-----
> From: krishna.zresearch@xxxxxxxxx 
> [mailto:krishna.zresearch@xxxxxxxxx] On Behalf Of Krishna Srinivas
> Sent: Thursday, May 08, 2008 6:35 AM
> To: Anand Avati
> Cc: Christopher Hawkins; gluster-devel@xxxxxxxxxx
> Subject: Re: AFR: machine crash hangs other 
> mountsortransportendpoint not connected
> 
> Chris,
> Do you see clues in the log files?
> Krishna
> 
> On Wed, Apr 30, 2008 at 8:22 PM, Anand Avati 
> <avati@xxxxxxxxxxxxx> wrote:
> > Chris,
> >   can you get the glusterfs client logs from your ramdisk when the 
> > servers  are being pulled out and tried to access the mount point?
> >
> >
> >
> >  avati
> >
> >  2008/4/30 Christopher Hawkins <chawkins@xxxxxxxxxxxxxxxxxxxx>:
> >
> >  > Without. All that is removed...
> >  >
> >  >
> >  >  _____
> >  >
> >  > From: anand.avati@xxxxxxxxx [mailto:anand.avati@xxxxxxxxx] On 
> > Behalf Of  > Anand Avati  > Sent: Wednesday, April 30, 2008 
> 10:24 AM  
> > > To: Christopher Hawkins  > Cc: gluster-devel@xxxxxxxxxx  
> > Subject: 
> > Re: AFR: machine crash hangs other  > 
> > mountsortransportendpoint not connected  >  >  > Chris,  >  is this 
> > hang with IP failover in place or without?
> >  >
> >  > avati
> >  >
> >  >
> >  > 2008/4/30 Christopher Hawkins <chawkins@xxxxxxxxxxxxxxxxxxxx>:
> >  >
> >  >
> >  >
> >  > Gluster devs,
> >  >
> >  > I am still not able to keep the client from hanging in a 
> diskless 
> > cluster  > node. When I fail a server the client becomes 
> unresponsive 
> > and does not  > read  > from the other AFR volume. I first 
> moved the 
> > entire /lib and /bin and  > /sbin  > directories into the ramdisk 
> > which runs the nodes to rule out the simple  > loss of an 
> odd binary 
> > or library... An lsof | grep gluster on the client  > (pre-failover 
> > test) shows:
> >  >
> >  > [root@node1 ~]# lsof |grep gluster
> >  > glusterfs 2195   root  cwd       DIR        0,1        0 
>         2 /
> >  > glusterfs 2195   root  rtd       DIR        0,1        0 
>         2 /
> >  > glusterfs 2195   root  txt       REG        0,1    55592 
>      3863
> >  > /bin/glusterfs
> >  > glusterfs 2195   root  mem       REG        0,1   341068 
>      2392
> >  > /lib/libfuse.so.2.7.2
> >  > glusterfs 2195   root  mem       REG        0,1   118096 
>      2505
> >  > /lib/glusterfs/1.3.8pre6/xlator/mount/fuse.so
> >  > glusterfs 2195   root  mem       REG        0,1   164703 
>      2514
> >  > /lib/glusterfs/1.3.8pre6/xlator/protocol/client.so
> >  > glusterfs 2195   root  mem       REG        0,1   112168 
>        77
> >  > /lib/ld-2.3.4.so
> >  > glusterfs 2195   root  mem       REG        0,1  1529120 
>      2483
> >  > /lib/tls/libc-2.3.4.so
> >  > glusterfs 2195   root  mem       REG        0,1    16732 
>        70
> >  > /lib/libdl-2.3.4.so
> >  > glusterfs 2195   root  mem       REG        0,1   107800 
>      2485
> >  > /lib/tls/libpthread-2.3.4.so
> >  > glusterfs 2195   root  mem       REG        0,1    43645 
>      2533
> >  > /lib/glusterfs/1.3.8pre6/transport/tcp/client.so
> >  > glusterfs 2195   root  mem       REG        0,1   427763 
>      2456
> >  > /lib/libglusterfs.so.0.0.0
> >  > glusterfs 2195   root  mem       REG        0,1    50672 
>      2474
> >  > /lib/tls/librt-2.3.4.so
> >  > glusterfs 2195   root  mem       REG        0,1   245686 
>      2522
> >  > /lib/glusterfs/1.3.8pre6/xlator/cluster/afr.so
> >  > glusterfs 2195   root    0u      CHR        1,3          
>      3393
> >  > /dev/null
> >  > glusterfs 2195   root    1u      CHR        1,3          
>      3393
> >  > /dev/null
> >  > glusterfs 2195   root    2u      CHR        1,3          
>      3393
> >  > /dev/null
> >  > glusterfs 2195   root    3w      REG        0,1      102 
>      4495
> >  > /var/log/glusterfs/glusterfs.log
> >  > glusterfs 2195   root    4u      CHR     10,229          
>      3494
> >  > /dev/fuse
> >  > glusterfs 2195   root    5r     0000        0,8        0 
>      4498
> >  > eventpoll
> >  > glusterfs 2195   root    6u     IPv4       4499          
>       TCP
> >  > 192.168.20.155:1023->master1:6996 (ESTABLISHED)
> >  > glusterfs 2195   root    7u     IPv4       4500          
>       TCP
> >  > 192.168.20.155:1022->master2:6996 (ESTABLISHED)  >  > Everything 
> > listed here is a local file and the gluster binary has 
> access  > to  > 
> > them during failover. Can you help me troubleshoot by 
> explaining what  
> > > exactly gluster is doing when it loses a connection? Does 
> it depend 
> > on  > something I have missed? This failover test uses the 
> same config 
> > files and  > binaries that my earlier tests use (which 
> succeeded, but 
> > were not run on a  > diskless node). There must be 
> something else in 
> > the filesystem that  > glusterfs requires to failover successfully?
> >  >
> >  > Thanks,
> >  > Chris
> >  >
> >  >
> >  > > > >
> >  > > > >
> >  > > > > > Gerry, Christopher,
> >  > > > > >
> >  > > > > > Here is what I tried to do. Two servers, one 
> client, simple  
> > > > > > setup, afr  > > > > > on the client side. I did 
> "ls" on client 
> > mount point, it  > > > > works, now I  > > > > > do "ifconfig eth0 
> > down"
> >  > > > > > on the server, next I do "ls" on client, it 
> hangs for 10  > 
> > > > > secs (timeout  > > > > > value) and fails over and starts 
> > working again without  > > > any problem.
> >  > > > > >
> >  > > > > > I guess few users are facing the problem you 
> guys are facing.
> >  > > > > > Can you give us your setup details and mention the  > > 
> > exact steps to  > > > > > reproduce. Also try to come up 
> with minimal 
> > config details  > > > > which can  > > > > > still reproduce the 
> > problem  > > > > >  > > > > > Thanks!
> >  > > > > > Krishna
> >  > > > > >
> >  > > > > > On Sat, Apr 26, 2008 at 7:01 AM, Christopher 
> Hawkins  > > > 
> > > > <chawkins@xxxxxxxxxxxxxxxxxxxx> wrote:
> >  > > > > > > I am having the same issue. I'm working on a 
> diskless  > 
> > > > > node cluster  > > > > > > and figured the issue was 
> related to 
> > that  since AFR  > > > > seems to fail  > > > > > > over nicely for 
> > everyone else...
> >  > > > > > >  But it seems I am not alone, so what can I do 
> to help  > 
> > > > > troubleshoot?
> >  > > > > > >
> >  > > > > > >  I have two servers exporting a brick each, 
> and a  > > > 
> > client mounting  > > > > > > them both with AFR and no unify. 
> > Transport timeout  > > > > settings  don't  > > > > > > 
> seem to make a 
> > difference - client is just hung if I  > > > > power off  
> or  > > > > 
> > > > just stop glusterfsd. There is nothing logged on the  > 
> > > server 
> > side.
> >  > > > > > >  I'll use a usb thumb drive for client side 
> logging since  
> > > > > > > any logs in  > > > > > > the ramdisk obviously disappear 
> > after the reboot which  > > > > > fixes the hang...
> >  > > > > > >  If I get any insight from this I'll report it asap.
> >  > > > > > >
> >  > > > > > >  Thanks,
> >  > > > > > >  Chris
> >  > > > > > >
> >  > > > > > >
> >  > > > > > >
> >  > > > >
> >  > > > >
> >  > > > >
> >  > > > > _______________________________________________
> >  > > > > Gluster-devel mailing list
> >  > > > > Gluster-devel@xxxxxxxxxx
> >  > > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >  > > > >
> >  > > >
> >  > > >
> >  > > >
> >  > > > _______________________________________________
> >  > > > Gluster-devel mailing list
> >  > > > Gluster-devel@xxxxxxxxxx
> >  > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >  > > >
> >  > >
> >  > >
> >  > >
> >  > > _______________________________________________
> >  > > Gluster-devel mailing list
> >  > > Gluster-devel@xxxxxxxxxx
> >  > > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >  > >
> >  >
> >  >
> >  >
> >  > _______________________________________________
> >  > Gluster-devel mailing list
> >  > Gluster-devel@xxxxxxxxxx
> >  > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >  >
> >  >
> >  >
> >  >
> >  >
> >  > --
> >  > If I traveled to the end of the rainbow  > As Dame Fortune did 
> > intend,  > Murphy would be there to tell me  > The pot's at 
> the other 
> > end.
> >  >
> >  > _______________________________________________
> >  > Gluster-devel mailing list
> >  > Gluster-devel@xxxxxxxxxx
> >  > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >  >
> >
> >
> >
> >  --
> >  If I traveled to the end of the rainbow  As Dame Fortune 
> did intend,  
> > Murphy would be there to tell me  The pot's at the other end.
> >  _______________________________________________
> >  Gluster-devel mailing list
> >  Gluster-devel@xxxxxxxxxx
> >  http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
>