On Wed, Jun 11, 2014 at 12:58:46PM +0200, Niels de Vos wrote: > On Wed, Jun 11, 2014 at 01:31:04PM +0530, Vijay Bellur wrote: > > On 06/11/2014 10:45 AM, Pranith Kumar Karampuri wrote: > > > > > >On 06/11/2014 09:45 AM, Vijay Bellur wrote: > > >>On 06/11/2014 08:21 AM, Pranith Kumar Karampuri wrote: > > >>>hi, > > >>> I see that quota-anon-fd.t is causing too many spurious failures. I > > >>>think we should revert it and raise a bug so that it can be fixed and > > >>>committed again along with the fix. > > >>> > > >> > > >>I think we can do that. The problem here is stemming from the issue > > >>that nfs can deadlock when we have client and servers on the same node > > >>with system memory utilization being on the higher side. We also need > > >>to look into other nfs tests to determine if there are similar > > >>possibilities. > > > > > >I doubt it is because of that, there are so many nfs mount tests, > > > > I have been following this problem closely on b.g.o. This backtrace > > does indicate dd being hung: > > > > INFO: task dd:6039 blocked for more than 120 seconds. > > Not tainted 2.6.32-431.3.1.el6.x86_64 #1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > dd D ffff880028100840 0 6039 5704 0x00000080 > > ffff8801f843faa8 0000000000000286 ffff8801ffffffff 01eff88bb6f58e28 > > ffff8801db96bb80 ffff8801f8213590 00000000036c74dc ffffffffac6f4edf > > ffff8801faf11af8 ffff8801f843ffd8 000000000000fbc8 ffff8801faf11af8 > > Call Trace: > > [<ffffffff810a70b1>] ? ktime_get_ts+0xb1/0xf0 > > [<ffffffff8111f940>] ? sync_page+0x0/0x50 > > [<ffffffff815280b3>] io_schedule+0x73/0xc0 > > [<ffffffff8111f97d>] sync_page+0x3d/0x50 > > [<ffffffff81528b7f>] __wait_on_bit+0x5f/0x90 > > [<ffffffff8111fbb3>] wait_on_page_bit+0x73/0x80 > > [<ffffffff8109b330>] ? wake_bit_function+0x0/0x50 > > [<ffffffff81135c05>] ? pagevec_lookup_tag+0x25/0x40 > > [<ffffffff8111ffdb>] wait_on_page_writeback_range+0xfb/0x190 > > [<ffffffff811201a8>] filemap_write_and_wait_range+0x78/0x90 > > [<ffffffff811baa4e>] vfs_fsync_range+0x7e/0x100 > > [<ffffffff811bab1b>] generic_write_sync+0x4b/0x50 > > [<ffffffff81122056>] generic_file_aio_write+0xe6/0x100 > > [<ffffffffa042f20e>] nfs_file_write+0xde/0x1f0 [nfs] > > [<ffffffff81188c8a>] do_sync_write+0xfa/0x140 > > [<ffffffff8152a825>] ? page_fault+0x25/0x30 > > [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40 > > [<ffffffff8128ec6f>] ? __clear_user+0x3f/0x70 > > [<ffffffff8128ec51>] ? __clear_user+0x21/0x70 > > [<ffffffff812263d6>] ? security_file_permission+0x16/0x20 > > [<ffffffff81188f88>] vfs_write+0xb8/0x1a0 > > [<ffffffff81189881>] sys_write+0x51/0x90 > > [<ffffffff810e1e6e>] ? __audit_syscall_exit+0x25e/0x290 > > [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b > > > > I have seen dd being in uninterruptible sleep on b.g.o. There are > > also instances [1] where anon-fd-nfs has run for close to 6000+ > > seconds. This definitely points to the nfs deadlock. > > [1] is a run where nfs.drc is still enabled. I'd like to know if you > have seen other, more recent runs where http://review.gluster.org/8004 > has been included (disable nfs.drc by default). To answer my own question, yes some runs have that included: - http://build.gluster.org/job/regression/4828/console Should Bug 1107937 "quota-anon-fd-nfs.t fails spuriously" be used to figure out what the problem is and diagnose the issues there? Niels > > Are there backtraces at the same time where alloc_pages() and/or > try_to_free_pages() are listed? The blocking of the writer (here: dd) > likely depends on the needed memory allocations on the receiving enf > (here: nfs-server). This is a relatively common issue for the Linux > kernel NFS server where loopback-mounts are used under memory pressure. > > A nice description and proposed solution of this has recently been > posted to LWN.net: > - http://lwn.net/Articles/595652/ > > This solution is client-side (the NFS-client in the Linux kernel), and > that should help preventing these issues for Gluster-nfs too (with > a quick cursory look through it). But I don't think the patches have > been merged yet. > > > >only > > >this one keeps failing for the past 2-3 days. > > > > It is a function of the system memory consumption and what oom > > killer decides to kill. If NFS or a glusterfsd process gets killed, > > then the test unit will fail. If the test can continue till the > > system reclaims memory, it can possibly succeed. > > > > However, there could be other possibilities and we need to root > > cause them as well. > > Yes, I agree. It would help if there is a known way to trigger the OOM > so that investigation can be done on a different system than > build.gluster.org. Does anyone know of steps that reliably reproduce > this kind of issue? > > Thanks, > Niels > > > > > > > -Vijay > > > > [1] http://build.gluster.org/job/regression/4783/console > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx > > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://supercolony.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel