Re: [PATCH 5/5] xfs: kick inode writeback when low on memory

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 2 Mar 2011 14:06:02 +1100

On Wed, Feb 23, 2011 at 09:16:09AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> When the inode cache shrinker runs, we may have lots of dirty inodes queued up
> in the VFS dirty queues that have not been expired. The typical case for this
> with XFS is atime updates. The result is that a highly concurrent workload that
> copies files and then later reads them (say to verify checksums) dirties all
> the inodes again, even when relatime is used.
> 
> In a constrained memory environment, this results in a large number of dirty
> inodes using all of available memory and memory reclaim being unable to free
> them as dirty inodes areconsidered active. This problem was uncovered by Chris
> Mason during recent low memory stress testing.
> 
> The fix is to trigger VFS level writeback from the XFS inode cache shrinker if
> there isn't already writeback in progress. This ensures that when we enter a
> low memory situation we start cleaning inodes (via the flusher thread) on the
> filesystem immediately, thereby making it more likely that we will be able to
> evict those dirty inodes from the VFS in the near future.
> 
> The mechanism is not perfect - it only acts on the current filesystem, so if
> all the dirty inodes are on a different filesystem it won't help. However, it
> seems to be a valid assumption is that the filesystem with lots of dirty inodes
> is going to have the shrinker called very soon after the memory shortage
> begins, so this shouldn't be an issue.
> 
> The other flaw is that there is no guarantee that the flusher thread will make
> progress fast enough to clean the dirty inodes so they can be reclaimed in the
> near future. However, this mechanism does improve the resilience of the
> filesystem under the test conditions - instead of reliably triggering the OOM
> killer 20 minutes into the stress test, it took more than 6 hours before it
> happened.
> 
> This small addition definitely improves the low memory resilience of XFS on
> this type of workload, and best of all it has no impact on performance when
> memory is not constrained.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/linux-2.6/xfs_sync.c |   11 +++++++++++
>  1 files changed, 11 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index 35138dc..3abde91 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.c
> +++ b/fs/xfs/linux-2.6/xfs_sync.c
> @@ -1044,6 +1044,17 @@ xfs_reclaim_inode_shrink(
>  		if (!(gfp_mask & __GFP_FS))
>  			return -1;
>  
> +		/*
> +		 * make sure VFS is cleaning inodes so they can be pruned
> +		 * and marked for reclaim in the XFS inode cache. If we don't
> +		 * do this the VFS can accumulate dirty inodes and we can OOM
> +		 * before they are cleaned by the periodic VFS writeback.
> +		 *
> +		 * This takes VFS level locks, so we can only do this after
> +		 * the __GFP_FS checks otherwise lockdep gets really unhappy.
> +		 */
> +		writeback_inodes_sb_nr_if_idle(mp->m_super, nr_to_scan);
> +

Well, this generates a deadlock if we get a low memory situation
before the bdi flusher thread for the underly device has been
created. That is, we get low memory, kick
writeback_inodes_sb_nr_if_idle(), we end up with the bdi-default
thread trying to create the flush-x:y thread, which gets stuck
waiting for kthread_create() to complete.

kthread_create() never completes because the do_fork() call in the
kthreadd fails memory allocation and again calls (via the shrinker)
writeback_inodes_sb_nr_if_idle(), which thinks that
writeback_in_progress(bdi) is false, so tries to start
writeback again....

So, writeback_inodes_sb_nr_if_idle() is busted w.r.t. only queuing a
single writeback instance as writeback is only marked as in progress
once the queued callback is running. Perhaps writeback_in_progress()
should return try if the BDI_Pending bit is set, indicating the
flusher thread is being created right now, but I'm not sure that is
sufficient to avoid all the potential races here.

I'm open to ideas here - I could convert the bdi flusher
infrastructure to cmwqs rather than using worker threads, or move
all dirty inode tracking and writeback into XFS, or ???

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs