When I set up a minimal ramdisk environment with no fancy directory re-mapping going on, the failover works. I had to work on other things for a few days and have not resolved it 100% yet, but it appears that my setup is creating the issue, not glusterfs. But thank you for the follow up! If I suspect a gluster issue I will reopen the thread. Chris > -----Original Message----- > From: krishna.zresearch@xxxxxxxxx > [mailto:krishna.zresearch@xxxxxxxxx] On Behalf Of Krishna Srinivas > Sent: Thursday, May 08, 2008 6:35 AM > To: Anand Avati > Cc: Christopher Hawkins; gluster-devel@xxxxxxxxxx > Subject: Re: AFR: machine crash hangs other > mountsortransportendpoint not connected > > Chris, > Do you see clues in the log files? > Krishna > > On Wed, Apr 30, 2008 at 8:22 PM, Anand Avati > <avati@xxxxxxxxxxxxx> wrote: > > Chris, > > can you get the glusterfs client logs from your ramdisk when the > > servers are being pulled out and tried to access the mount point? > > > > > > > > avati > > > > 2008/4/30 Christopher Hawkins <chawkins@xxxxxxxxxxxxxxxxxxxx>: > > > > > Without. All that is removed... > > > > > > > > > _____ > > > > > > From: anand.avati@xxxxxxxxx [mailto:anand.avati@xxxxxxxxx] On > > Behalf Of > Anand Avati > Sent: Wednesday, April 30, 2008 > 10:24 AM > > > To: Christopher Hawkins > Cc: gluster-devel@xxxxxxxxxx > > Subject: > > Re: AFR: machine crash hangs other > > > mountsortransportendpoint not connected > > > Chris, > is this > > hang with IP failover in place or without? > > > > > > avati > > > > > > > > > 2008/4/30 Christopher Hawkins <chawkins@xxxxxxxxxxxxxxxxxxxx>: > > > > > > > > > > > > Gluster devs, > > > > > > I am still not able to keep the client from hanging in a > diskless > > cluster > node. When I fail a server the client becomes > unresponsive > > and does not > read > from the other AFR volume. I first > moved the > > entire /lib and /bin and > /sbin > directories into the ramdisk > > which runs the nodes to rule out the simple > loss of an > odd binary > > or library... An lsof | grep gluster on the client > (pre-failover > > test) shows: > > > > > > [root@node1 ~]# lsof |grep gluster > > > glusterfs 2195 root cwd DIR 0,1 0 > 2 / > > > glusterfs 2195 root rtd DIR 0,1 0 > 2 / > > > glusterfs 2195 root txt REG 0,1 55592 > 3863 > > > /bin/glusterfs > > > glusterfs 2195 root mem REG 0,1 341068 > 2392 > > > /lib/libfuse.so.2.7.2 > > > glusterfs 2195 root mem REG 0,1 118096 > 2505 > > > /lib/glusterfs/1.3.8pre6/xlator/mount/fuse.so > > > glusterfs 2195 root mem REG 0,1 164703 > 2514 > > > /lib/glusterfs/1.3.8pre6/xlator/protocol/client.so > > > glusterfs 2195 root mem REG 0,1 112168 > 77 > > > /lib/ld-2.3.4.so > > > glusterfs 2195 root mem REG 0,1 1529120 > 2483 > > > /lib/tls/libc-2.3.4.so > > > glusterfs 2195 root mem REG 0,1 16732 > 70 > > > /lib/libdl-2.3.4.so > > > glusterfs 2195 root mem REG 0,1 107800 > 2485 > > > /lib/tls/libpthread-2.3.4.so > > > glusterfs 2195 root mem REG 0,1 43645 > 2533 > > > /lib/glusterfs/1.3.8pre6/transport/tcp/client.so > > > glusterfs 2195 root mem REG 0,1 427763 > 2456 > > > /lib/libglusterfs.so.0.0.0 > > > glusterfs 2195 root mem REG 0,1 50672 > 2474 > > > /lib/tls/librt-2.3.4.so > > > glusterfs 2195 root mem REG 0,1 245686 > 2522 > > > /lib/glusterfs/1.3.8pre6/xlator/cluster/afr.so > > > glusterfs 2195 root 0u CHR 1,3 > 3393 > > > /dev/null > > > glusterfs 2195 root 1u CHR 1,3 > 3393 > > > /dev/null > > > glusterfs 2195 root 2u CHR 1,3 > 3393 > > > /dev/null > > > glusterfs 2195 root 3w REG 0,1 102 > 4495 > > > /var/log/glusterfs/glusterfs.log > > > glusterfs 2195 root 4u CHR 10,229 > 3494 > > > /dev/fuse > > > glusterfs 2195 root 5r 0000 0,8 0 > 4498 > > > eventpoll > > > glusterfs 2195 root 6u IPv4 4499 > TCP > > > 192.168.20.155:1023->master1:6996 (ESTABLISHED) > > > glusterfs 2195 root 7u IPv4 4500 > TCP > > > 192.168.20.155:1022->master2:6996 (ESTABLISHED) > > Everything > > listed here is a local file and the gluster binary has > access > to > > > them during failover. Can you help me troubleshoot by > explaining what > > > exactly gluster is doing when it loses a connection? Does > it depend > > on > something I have missed? This failover test uses the > same config > > files and > binaries that my earlier tests use (which > succeeded, but > > were not run on a > diskless node). There must be > something else in > > the filesystem that > glusterfs requires to failover successfully? > > > > > > Thanks, > > > Chris > > > > > > > > > > > > > > > > > > > > > > > > > Gerry, Christopher, > > > > > > > > > > > > > > Here is what I tried to do. Two servers, one > client, simple > > > > > > setup, afr > > > > > on the client side. I did > "ls" on client > > mount point, it > > > > works, now I > > > > > do "ifconfig eth0 > > down" > > > > > > > on the server, next I do "ls" on client, it > hangs for 10 > > > > > > secs (timeout > > > > > value) and fails over and starts > > working again without > > > any problem. > > > > > > > > > > > > > > I guess few users are facing the problem you > guys are facing. > > > > > > > Can you give us your setup details and mention the > > > > exact steps to > > > > > reproduce. Also try to come up > with minimal > > config details > > > > which can > > > > > still reproduce the > > problem > > > > > > > > > > Thanks! > > > > > > > Krishna > > > > > > > > > > > > > > On Sat, Apr 26, 2008 at 7:01 AM, Christopher > Hawkins > > > > > > > <chawkins@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > I am having the same issue. I'm working on a > diskless > > > > > > node cluster > > > > > > and figured the issue was > related to > > that since AFR > > > > seems to fail > > > > > > over nicely for > > everyone else... > > > > > > > > But it seems I am not alone, so what can I do > to help > > > > > > troubleshoot? > > > > > > > > > > > > > > > > I have two servers exporting a brick each, > and a > > > > > client mounting > > > > > > them both with AFR and no unify. > > Transport timeout > > > > settings don't > > > > > > > seem to make a > > difference - client is just hung if I > > > > power off > or > > > > > > > > just stop glusterfsd. There is nothing logged on the > > > > server > > side. > > > > > > > > I'll use a usb thumb drive for client side > logging since > > > > > > > any logs in > > > > > > the ramdisk obviously disappear > > after the reboot which > > > > > fixes the hang... > > > > > > > > If I get any insight from this I'll report it asap. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Chris > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > Gluster-devel mailing list > > > > > > Gluster-devel@xxxxxxxxxx > > > > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > Gluster-devel mailing list > > > > > Gluster-devel@xxxxxxxxxx > > > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Gluster-devel mailing list > > > > Gluster-devel@xxxxxxxxxx > > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel@xxxxxxxxxx > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > > > > > > > > -- > > > If I traveled to the end of the rainbow > As Dame Fortune did > > intend, > Murphy would be there to tell me > The pot's at > the other > > end. > > > > > > _______________________________________________ > > > Gluster-devel mailing list > > > Gluster-devel@xxxxxxxxxx > > > http://lists.nongnu.org/mailman/listinfo/gluster-devel > > > > > > > > > > > -- > > If I traveled to the end of the rainbow As Dame Fortune > did intend, > > Murphy would be there to tell me The pot's at the other end. > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxx > > http://lists.nongnu.org/mailman/listinfo/gluster-devel > > >