Re: [PATCH 2/2] xfs: kick inode writeback when low on memory

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 14 Apr 2011 15:08:46 +1000

On Wed, Apr 13, 2011 at 03:33:42PM -0500, Alex Elder wrote:
> On Thu, 2011-04-07 at 16:19 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > When the inode cache shrinker runs, we may have lots of dirty inodes queued up
> > in the VFS dirty queues that have not been expired. The typical case for this
> > with XFS is atime updates. The result is that a highly concurrent workload that
> > copies files and then later reads them (say to verify checksums) dirties all
> > the inodes again, even when relatime is used.
> > 
> > In a constrained memory environment, this results in a large number of dirty
> > inodes using all of available memory and memory reclaim being unable to free
> > them as dirty inodes areconsidered active. This problem was uncovered by Chris
> > Mason during recent low memory stress testing.
> > 
> > The fix is to trigger VFS level writeback from the XFS inode cache shrinker if
> > there isn't already writeback in progress. This ensures that when we enter a
> > low memory situation we start cleaning inodes (via the flusher thread) on the
> > filesystem immediately, thereby making it more likely that we will be able to
> > evict those dirty inodes from the VFS in the near future.
> > 
> > The mechanism is not perfect - it only acts on the current filesystem, so if
> > all the dirty inodes are on a different filesystem it won't help. However, it
> > seems to be a valid assumption is that the filesystem with lots of dirty inodes
> > is going to have the shrinker called very soon after the memory shortage
> > begins, so this shouldn't be an issue.
> > 
> > The other flaw is that there is no guarantee that the flusher thread will make
> > progress fast enough to clean the dirty inodes so they can be reclaimed in the
> > near future. However, this mechanism does improve the resilience of the
> > filesystem under the test conditions - instead of reliably triggering the OOM
> > killer 20 minutes into the stress test, it took more than 6 hours before it
> > happened.
> > 
> > This small addition definitely improves the low memory resilience of XFS on
> > this type of workload, and best of all it has no impact on performance when
> > memory is not constrained.
> > 
> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Looks good to me.
> 
> Reviewed-by: Alex Elder <aelder@xxxxxxx>

Unfortunately, we simply can't take the s_umount lock in reclaim
context. So further hackery is going to be required here - I think
that writeback_inodes_sb_nr_if_idle() need to use trylocks. if the
s_umount lock is taken in write mode, then it's pretty certain that
the sb is busy....

[ 2226.939859] =================================
[ 2226.940026] [ INFO: inconsistent lock state ]
[ 2226.940026] 2.6.39-rc3-dgc+ #1162
[ 2226.940026] ---------------------------------
[ 2226.940026] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
[ 2226.940026] diff/23704 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 2226.940026]  (&type->s_umount_key#23){+++++?}, at: [<ffffffff81191bf0>] writeback_inodes_sb_nr_if_idle+0x50/0x80
[ 2226.940026] {RECLAIM_FS-ON-W} state was registered at:
[ 2226.940026]   [<ffffffff810c06a7>] mark_held_locks+0x67/0x90
[ 2226.940026]   [<ffffffff810c0796>] lockdep_trace_alloc+0xc6/0x100
[ 2226.940026]   [<ffffffff8115fec9>] kmem_cache_alloc+0x39/0x1e0
[ 2226.940026]   [<ffffffff814afce7>] kmem_zone_alloc+0x77/0xf0
[ 2226.940026]   [<ffffffff814afd7e>] kmem_zone_zalloc+0x1e/0x50
[ 2226.940026]   [<ffffffff814a5f51>] _xfs_trans_alloc+0x31/0x80
[ 2226.940026]   [<ffffffff814a1b74>] xfs_log_sbcount+0x84/0xf0
[ 2226.940026]   [<ffffffff814a26be>] xfs_unmountfs+0xde/0x1a0
[ 2226.940026]   [<ffffffff814bd466>] xfs_fs_put_super+0x46/0x80
[ 2226.940026]   [<ffffffff8116cb92>] generic_shutdown_super+0x72/0x100
[ 2226.940026]   [<ffffffff8116cc51>] kill_block_super+0x31/0x80
[ 2226.940026]   [<ffffffff8116d415>] deactivate_locked_super+0x45/0x60
[ 2226.940026]   [<ffffffff8116e10a>] deactivate_super+0x4a/0x70
[ 2226.940026]   [<ffffffff8118951c>] mntput_no_expire+0xec/0x140
[ 2226.940026]   [<ffffffff81189a08>] sys_umount+0x78/0x3c0
[ 2226.940026]   [<ffffffff81b76c82>] system_call_fastpath+0x16/0x1b
[ 2226.940026] irq event stamp: 2767751
[ 2226.940026] hardirqs last  enabled at (2767751): [<ffffffff810ee0b6>] __call_rcu+0xa6/0x190
[ 2226.940026] hardirqs last disabled at (2767750): [<ffffffff810ee05a>] __call_rcu+0x4a/0x190
[ 2226.940026] softirqs last  enabled at (2758484): [<ffffffff8108d1a3>] __do_softirq+0x143/0x220
[ 2226.940026] softirqs last disabled at (2758471): [<ffffffff81b77e9c>] call_softirq+0x1c/0x30
[ 2226.940026]
[ 2226.940026] other info that might help us debug this:
[ 2226.940026] 3 locks held by diff/23704:
[ 2226.940026]  #0:  (xfs_iolock_active){++++++}, at: [<ffffffff81487408>] xfs_ilock+0x138/0x190
[ 2226.940026]  #1:  (&mm->mmap_sem){++++++}, at: [<ffffffff81b7258b>] do_page_fault+0xeb/0x4f0
[ 2226.940026]  #2:  (shrinker_rwsem){++++..}, at: [<ffffffff8112cb6d>] shrink_slab+0x3d/0x1a0
[ 2226.940026]
[ 2226.940026] stack backtrace:
[ 2226.940026] Pid: 23704, comm: diff Not tainted 2.6.39-rc3-dgc+ #1162
[ 2226.940026] Call Trace:
[ 2226.940026]  [<ffffffff810bf5fa>] print_usage_bug+0x18a/0x190
[ 2226.940026]  [<ffffffff8104982f>] ? save_stack_trace+0x2f/0x50
[ 2226.940026]  [<ffffffff810bf770>] ? print_irq_inversion_bug+0x170/0x170
[ 2226.940026]  [<ffffffff810c055e>] mark_lock+0x35e/0x440
[ 2226.940026]  [<ffffffff810c1227>] __lock_acquire+0x447/0x14b0
[ 2226.940026]  [<ffffffff81065ed8>] ? pvclock_clocksource_read+0x58/0xd0
[ 2226.940026]  [<ffffffff814a84c8>] ? xfs_ail_push_all+0x78/0x80
[ 2226.940026]  [<ffffffff810650b9>] ? kvm_clock_read+0x19/0x20
[ 2226.940026]  [<ffffffff81042bc9>] ? sched_clock+0x9/0x10
[ 2226.940026]  [<ffffffff810aff15>] ? sched_clock_local+0x25/0x90
[ 2226.940026]  [<ffffffff810c2344>] lock_acquire+0xb4/0x140
[ 2226.940026]  [<ffffffff81191bf0>] ? writeback_inodes_sb_nr_if_idle+0x50/0x80
[ 2226.940026]  [<ffffffff81b76a16>] ? ftrace_call+0x5/0x2b
[ 2226.940026]  [<ffffffff81b6d731>] down_read+0x51/0xa0
[ 2226.940026]  [<ffffffff81191bf0>] ? writeback_inodes_sb_nr_if_idle+0x50/0x80
[ 2226.940026]  [<ffffffff81191bf0>] writeback_inodes_sb_nr_if_idle+0x50/0x80
[ 2226.940026]  [<ffffffff814bec18>] ? xfs_syncd_queue_reclaim+0x28/0xc0
[ 2226.940026]  [<ffffffff814c02e9>] xfs_reclaim_inode_shrink+0x99/0xc0
[ 2226.940026]  [<ffffffff8112cc67>] shrink_slab+0x137/0x1a0
[ 2226.940026]  [<ffffffff8112e40c>] do_try_to_free_pages+0x20c/0x440
[ 2226.940026]  [<ffffffff8112e7a2>] try_to_free_pages+0x92/0x130
[ 2226.940026]  [<ffffffff81124826>] __alloc_pages_nodemask+0x496/0x930
[ 2226.940026]  [<ffffffff810aff15>] ? sched_clock_local+0x25/0x90
[ 2226.940026]  [<ffffffff81b76a16>] ? ftrace_call+0x5/0x2b
[ 2226.940026]  [<ffffffff8115c169>] alloc_pages_vma+0x99/0x150
[ 2226.940026]  [<ffffffff811681b3>] do_huge_pmd_anonymous_page+0x143/0x380
[ 2226.940026]  [<ffffffff81b76a16>] ? ftrace_call+0x5/0x2b
[ 2226.940026]  [<ffffffff81141b26>] handle_mm_fault+0x136/0x290
[ 2226.940026]  [<ffffffff81b72601>] do_page_fault+0x161/0x4f0
[ 2226.940026]  [<ffffffff810b0038>] ? sched_clock_cpu+0xb8/0x110
[ 2226.940026]  [<ffffffff810c1116>] ? __lock_acquire+0x336/0x14b0
[ 2226.940026]  [<ffffffff811275b8>] ? __do_page_cache_readahead+0x208/0x2b0
[ 2226.940026]  [<ffffffff81065ed8>] ? pvclock_clocksource_read+0x58/0xd0
[ 2226.940026]  [<ffffffff816d527d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2226.940026]  [<ffffffff81b6f265>] page_fault+0x25/0x30
[ 2226.940026]  [<ffffffff8111bed4>] ? file_read_actor+0x114/0x1d0
[ 2226.940026]  [<ffffffff8111bde1>] ? file_read_actor+0x21/0x1d0
[ 2226.940026]  [<ffffffff8111dfad>] generic_file_aio_read+0x35d/0x7b0
[ 2226.940026]  [<ffffffff814b769e>] xfs_file_aio_read+0x15e/0x2e0
[ 2226.940026]  [<ffffffff8116a4d0>] ? do_sync_write+0x120/0x120
[ 2226.940026]  [<ffffffff8116a5aa>] do_sync_read+0xda/0x120
[ 2226.940026]  [<ffffffff8169aeee>] ? security_file_permission+0x8e/0x90
[ 2226.940026]  [<ffffffff8116acdd>] vfs_read+0xcd/0x180
[ 2226.940026]  [<ffffffff8116ae94>] sys_read+0x54/0xa0
[ 2226.940026]  [<ffffffff81b76c82>] system_call_fastpath+0x16/0x1b

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs