Re: Shall we revert quota-anon-fd.t?

Vijay Bellur <vbellur@xxxxxxxxxx> · Wed, 11 Jun 2014 13:31:04 +0530

On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote:

On 06/11/2014 09:45 AM, Vijay Bellur wrote:
On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote:
hi,
    I see that quota-anon-fd.t is causing too many spurious failures. I
think we should revert it and raise a bug so that it can be fixed and
committed again along with the fix.

I think we can do that. The problem here is stemming from the issue
that nfs can deadlock when we have client and servers on the same node
with system memory utilization being on the higher side. We also need
to look into other nfs tests to determine if there are similar
possibilities.

I doubt it is because of that, there are so many nfs mount tests,

I have been following this problem closely on b.g.o. This backtrace does 
indicate dd being hung:

INFO: task dd:6039 blocked for more than 120 seconds.
      Not tainted 2.6.32-431.3.1.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dd            D ffff880028100840     0  6039   5704 0x00000080
 ffff8801f843faa8 0000000000000286 ffff8801ffffffff 01eff88bb6f58e28
 ffff8801db96bb80 ffff8801f8213590 00000000036c74dc ffffffffac6f4edf
 ffff8801faf11af8 ffff8801f843ffd8 000000000000fbc8 ffff8801faf11af8
Call Trace:
 [<ffffffff810a70b1>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff8111f940>] ? sync_page+0x0/0x50
 [<ffffffff815280b3>] io_schedule+0x73/0xc0
 [<ffffffff8111f97d>] sync_page+0x3d/0x50
 [<ffffffff81528b7f>] __wait_on_bit+0x5f/0x90
 [<ffffffff8111fbb3>] wait_on_page_bit+0x73/0x80
 [<ffffffff8109b330>] ? wake_bit_function+0x0/0x50
 [<ffffffff81135c05>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff8111ffdb>] wait_on_page_writeback_range+0xfb/0x190
 [<ffffffff811201a8>] filemap_write_and_wait_range+0x78/0x90
 [<ffffffff811baa4e>] vfs_fsync_range+0x7e/0x100
 [<ffffffff811bab1b>] generic_write_sync+0x4b/0x50
 [<ffffffff81122056>] generic_file_aio_write+0xe6/0x100
 [<ffffffffa042f20e>] nfs_file_write+0xde/0x1f0 [nfs]
 [<ffffffff81188c8a>] do_sync_write+0xfa/0x140
 [<ffffffff8152a825>] ? page_fault+0x25/0x30
 [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8128ec6f>] ? __clear_user+0x3f/0x70
 [<ffffffff8128ec51>] ? __clear_user+0x21/0x70
 [<ffffffff812263d6>] ? security_file_permission+0x16/0x20
 [<ffffffff81188f88>] vfs_write+0xb8/0x1a0
 [<ffffffff81189881>] sys_write+0x51/0x90
 [<ffffffff810e1e6e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

I have seen dd being in uninterruptible sleep on b.g.o. There are also 
instances [1] where anon-fd-nfs has run for close to 6000+ seconds. This 
definitely points to the nfs deadlock.

only
this one keeps failing for the past 2-3 days.

It is a function of the system memory consumption and what oom killer 
decides to kill. If NFS or a glusterfsd process gets killed, then the 
test unit will fail. If the test can continue till the system reclaims 
memory, it can possibly succeed.

However, there could be other possibilities and we need to root cause 
them as well.

-Vijay

[1] http://build.gluster.org/job/regression/4783/console

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel