Re: spurios failures in tests/encryption/crypt.t

Anand Avati <avati@xxxxxxxxxxx> · Tue, 20 May 2014 22:23:54 -0700

There are a few suspicious things going on here..

On Tue, May 20, 2014 at 10:07 PM, Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:

> > hi,

> >      crypt.t is failing regression builds once in a while and most of

> > the times it is because of the failures just after the remount in the

> > script.

> >

> > TEST rm -f $M0/testfile-symlink

> > TEST rm -f $M0/testfile-link

> >

> > Both of these are failing with ENOTCONN. I got a chance to look at

> > the logs. According to the brick logs, this is what I see:

> > [2014-05-17 05:43:43.363979] E [posix.c:2272:posix_open]

> > 0-patchy-posix: open on /d/backends/patchy1/testfile-symlink:

> > Transport endpoint is not connected

posix_open() happening on a symlink? This should NEVER happen. glusterfs itself should NEVER EVER by triggering symlink resolution on the server. In this case, for whatever reason an open() is attempted on a symlink, and it is getting followed back onto gluster's own mount point (test case is creating an absolute link).

So first find out: who is triggering fop->open() on a symlink. Fix the caller.

Next: add a check in posix_open() to fail with ELOOP or EINVAL if the inode is a symlink.

> >

> > This is the very first time I saw posix failing with ENOTCONN. Do we

> > have these bricks on some other network mounts? I wonder why it fails

> > with ENOTCONN.

> >

> > I also see that it happens right after a call_bail on the mount.

> >

> > Pranith

>

> Hello.

> OK, I'll try to reproduce it.

I tried re-creating the issue on my fedora VM and it happened just now. When this issue happens I am not able to attach the process to gdb. From /proc/ the threads are in the following state for a while now:

root@pranith-vm1 - /proc/4053/task

10:20:50 :) ⚡ for i in `ls`; do cat $i/stack; echo "---------------------------------"; done

[<ffffffff811ed8ce>] ep_poll+0x21e/0x330

[<ffffffff811ee7b5>] SyS_epoll_wait+0xd5/0x100

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

[<ffffffff8108cb6d>] hrtimer_nanosleep+0xad/0x170

[<ffffffff8108cc96>] SyS_nanosleep+0x66/0x80

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

[<ffffffff81079271>] do_sigtimedwait+0x161/0x200

[<ffffffff81079386>] SYSC_rt_sigtimedwait+0x76/0xd0

[<ffffffff810793ee>] SyS_rt_sigtimedwait+0xe/0x10

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

[<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140

[<ffffffff810c32be>] futex_wait+0x17e/0x290

[<ffffffff810c4e26>] do_futex+0xe6/0xc30

[<ffffffff810c59e1>] SyS_futex+0x71/0x150

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

[<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140

[<ffffffff810c32be>] futex_wait+0x17e/0x290

[<ffffffff810c4e26>] do_futex+0xe6/0xc30

[<ffffffff810c59e1>] SyS_futex+0x71/0x150

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

[<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140

[<ffffffff810c32be>] futex_wait+0x17e/0x290

[<ffffffff810c4e26>] do_futex+0xe6/0xc30

[<ffffffff810c59e1>] SyS_futex+0x71/0x150

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

[<ffffffffa0426229>] wait_answer_interruptible+0x89/0xd0 [fuse]  <<----------- This is the important thing I think

[<ffffffffa0426612>] __fuse_request_send+0x232/0x290 [fuse]

[<ffffffffa0426682>] fuse_request_send+0x12/0x20 [fuse]

[<ffffffffa042ebea>] fuse_do_open+0xca/0x170 [fuse]

[<ffffffffa042ee06>] fuse_open_common+0x56/0x80 [fuse]

[<ffffffffa042ee40>] fuse_open+0x10/0x20 [fuse]

[<ffffffff811a6e4b>] do_dentry_open+0x1eb/0x280

[<ffffffff811a6f11>] finish_open+0x31/0x40

[<ffffffff811b77ba>] do_last+0x4ca/0xe00

[<ffffffff811b8510>] path_openat+0x420/0x690

[<ffffffff811b8e4a>] do_filp_open+0x3a/0x90

[<ffffffff811a82ee>] do_sys_open+0x12e/0x210

[<ffffffff811a83ee>] SyS_open+0x1e/0x20

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

[<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140

[<ffffffff810c32be>] futex_wait+0x17e/0x290

[<ffffffff810c4e26>] do_futex+0xe6/0xc30

[<ffffffff810c59e1>] SyS_futex+0x71/0x150

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

[<ffffffff810c277a>] futex_wait_queue_me+0xda/0x140

[<ffffffff810c32be>] futex_wait+0x17e/0x290

[<ffffffff810c4e26>] do_futex+0xe6/0xc30

[<ffffffff810c59e1>] SyS_futex+0x71/0x150

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

[<ffffffff8108cb6d>] hrtimer_nanosleep+0xad/0x170

[<ffffffff8108cc96>] SyS_nanosleep+0x66/0x80

[<ffffffff816533d9>] system_call_fastpath+0x16/0x1b

[<ffffffffffffffff>] 0xffffffffffffffff

---------------------------------

I don't know how to debug further but it seems like the system call hung

The threads in the above process are of glusterfsd, and glusterfsd is ending up an open() attempt on a FUSE (its own) mount. Pretty obvious that it is deadlocking. Find the open()er on the symlink and you have your fix.

Avati
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel