----- Original Message ----- > From: "Anand Avati" <avati@xxxxxxxxxxx> > To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > Cc: "Edward Shishkin" <edward@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Sent: Wednesday, May 21, 2014 10:53:54 AM > Subject: Re: spurios failures in tests/encryption/crypt.t > > There are a few suspicious things going on here.. > > On Tue, May 20, 2014 at 10:07 PM, Pranith Kumar Karampuri < > pkarampu@xxxxxxxxxx> wrote: > > > > > > > hi, > > > > crypt.t is failing regression builds once in a while and most of > > > > the times it is because of the failures just after the remount in the > > > > script. > > > > > > > > TEST rm -f $M0/testfile-symlink > > > > TEST rm -f $M0/testfile-link > > > > > > > > Both of these are failing with ENOTCONN. I got a chance to look at > > > > the logs. According to the brick logs, this is what I see: > > > > [2014-05-17 05:43:43.363979] E [posix.c:2272:posix_open] > > > > 0-patchy-posix: open on /d/backends/patchy1/testfile-symlink: > > > > Transport endpoint is not connected > > > > posix_open() happening on a symlink? This should NEVER happen. glusterfs > itself should NEVER EVER by triggering symlink resolution on the server. In > this case, for whatever reason an open() is attempted on a symlink, and it > is getting followed back onto gluster's own mount point (test case is > creating an absolute link). > > So first find out: who is triggering fop->open() on a symlink. Fix the > caller. > > Next: add a check in posix_open() to fail with ELOOP or EINVAL if the inode > is a symlink. I think I understood what you are saying. Open call for symlink on fuse mount lead to an open call again for the target on the same fuse mount. Which lead to deadlock :). That is why we disallow opens on symlink in gluster? Pranith > > > > > > > > > > This is the very first time I saw posix failing with ENOTCONN. Do we > > > > have these bricks on some other network mounts? I wonder why it fails > > > > with ENOTCONN. > > > > > > > > I also see that it happens right after a call_bail on the mount. > > > > > > > > Pranith > > > > > > Hello. > > > OK, I'll try to reproduce it. > > > > I tried re-creating the issue on my fedora VM and it happened just now. > > When this issue happens I am not able to attach the process to gdb. From > > /proc/ the threads are in the following state for a while now: > > root@pranith-vm1 - /proc/4053/task > > 10:20:50 :) ⚡ for i in `ls`; do cat $i/stack; echo > > "---------------------------------"; done > > [<ffffffff811ed8ce>] ep_poll+0x21e/0x330 > > [<ffffffff811ee7b5>] SyS_epoll_wait+0xd5/0x100 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > [<ffffffff8108cb6d>] hrtimer_nanosleep+0xad/0x170 > > [<ffffffff8108cc96>] SyS_nanosleep+0x66/0x80 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > [<ffffffff81079271>] do_sigtimedwait+0x161/0x200 > > [<ffffffff81079386>] SYSC_rt_sigtimedwait+0x76/0xd0 > > [<ffffffff810793ee>] SyS_rt_sigtimedwait+0xe/0x10 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > [<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140 > > [<ffffffff810c32be>] futex_wait+0x17e/0x290 > > [<ffffffff810c4e26>] do_futex+0xe6/0xc30 > > [<ffffffff810c59e1>] SyS_futex+0x71/0x150 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > [<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140 > > [<ffffffff810c32be>] futex_wait+0x17e/0x290 > > [<ffffffff810c4e26>] do_futex+0xe6/0xc30 > > [<ffffffff810c59e1>] SyS_futex+0x71/0x150 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > [<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140 > > [<ffffffff810c32be>] futex_wait+0x17e/0x290 > > [<ffffffff810c4e26>] do_futex+0xe6/0xc30 > > [<ffffffff810c59e1>] SyS_futex+0x71/0x150 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > [<ffffffffa0426229>] wait_answer_interruptible+0x89/0xd0 [fuse] > > <<----------- This is the important thing I think > > [<ffffffffa0426612>] __fuse_request_send+0x232/0x290 [fuse] > > [<ffffffffa0426682>] fuse_request_send+0x12/0x20 [fuse] > > [<ffffffffa042ebea>] fuse_do_open+0xca/0x170 [fuse] > > [<ffffffffa042ee06>] fuse_open_common+0x56/0x80 [fuse] > > [<ffffffffa042ee40>] fuse_open+0x10/0x20 [fuse] > > [<ffffffff811a6e4b>] do_dentry_open+0x1eb/0x280 > > [<ffffffff811a6f11>] finish_open+0x31/0x40 > > [<ffffffff811b77ba>] do_last+0x4ca/0xe00 > > [<ffffffff811b8510>] path_openat+0x420/0x690 > > [<ffffffff811b8e4a>] do_filp_open+0x3a/0x90 > > [<ffffffff811a82ee>] do_sys_open+0x12e/0x210 > > [<ffffffff811a83ee>] SyS_open+0x1e/0x20 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > [<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140 > > [<ffffffff810c32be>] futex_wait+0x17e/0x290 > > [<ffffffff810c4e26>] do_futex+0xe6/0xc30 > > [<ffffffff810c59e1>] SyS_futex+0x71/0x150 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > [<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140 > > [<ffffffff810c32be>] futex_wait+0x17e/0x290 > > [<ffffffff810c4e26>] do_futex+0xe6/0xc30 > > [<ffffffff810c59e1>] SyS_futex+0x71/0x150 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > [<ffffffff8108cb6d>] hrtimer_nanosleep+0xad/0x170 > > [<ffffffff8108cc96>] SyS_nanosleep+0x66/0x80 > > [<ffffffff816533d9>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > --------------------------------- > > > > I don't know how to debug further but it seems like the system call hung > > > > The threads in the above process are of glusterfsd, and glusterfsd is > ending up an open() attempt on a FUSE (its own) mount. Pretty obvious that > it is deadlocking. Find the open()er on the symlink and you have your fix. > > Avati > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel