Re: [PATCH 5/6] writeback: sync expired inodes first in background writeback

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Tue, 26 Apr 2011 21:51:30 +0800

On Tue, Apr 26, 2011 at 08:17:51PM +0800, Jan Kara wrote:
> On Sun 24-04-11 11:15:31, Wu Fengguang wrote:
> > > One of the many requirements for writeback is that if userspace is
> > > continually dirtying pages in a particular file, that shouldn't cause
> > > the kupdate function to concentrate on that file's newly-dirtied pages,
> > > neglecting pages from other files which were less-recently dirtied. 
> > > (and dirty nodes, etc).
> > 
> > Sadly I do find the old pages that the flusher never get a chance to
> > catch and write them out.
>   What kind of load do you use?

Sorry I was just thinking about it and then got a _theoretic_ case.

> > In the below case, if the task dirties pages fast enough at the end of
> > file, writeback_index will never get a chance to wrap back. There may
> > be various variations of this case.
> > 
> > file head
> > [          ***                        ==>***************]==>
> >            old pages          writeback_index            fresh dirties
> > 
> > Ironically the current kernel relies on pageout() to catch these
> > old pages, which is not only inefficient, but also not reliable.
> > If a full LRU walk takes an hour, the old pages may stay dirtied
> > for an hour.
>   Well, the kupdate behavior has always been just a best-effort thing. We
> always tried to handle well common cases but didn't try to solve all of
> them. Unless we want to track dirty-age of every page (which we don't
> want because it's too expensive), there is really no way to make syncing
> of old pages 100% working for all the cases unless we do data-integrity
> type of writeback for the whole inode - but that could create new problems
> with stalling other files for too long I suspect.

Yeah, it's a hard problem in general. The flusher works naturally in
the coarse way..

> > We may have to do (conditional) tagged ->writepages to safeguard users
> > from losing data he'd expect to be written hours ago.
>   Well, if the file is continuously written (and in your case it must be
> even continuosly grown) I'd be content if we handle well the common case of
> linear append (that happens for log files etc.). If we can do well for more
> cases, even better but I'd be cautious not to disrupt some other more
> common cases.

I scratched a patch (totally untested) which will guarantee any kind
of starvation inside an inode. Will this be too overweight?

Thanks,
Fengguang
---
Subject: writeback: livelock prevention inside actively dirtied files
Date: Tue Apr 26 21:35:47 CST 2011

- refresh dirtied_when on every full writeback_index cycle
  (pages may be skipped on SYNC_NONE, but as long as they are retried in
  next cycle..)

- do tagged sync when writeback_index not cycled for too long time
  (the arbitrarily 60s may lead to more page tagging overheads in
  "large dirty threshold but slow storage" system..)

Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
---
 fs/fs-writeback.c       |    1 +
 include/linux/fs.h      |    1 +
 include/linux/pagemap.h |   16 ++++++++++++++++
 mm/page-writeback.c     |   24 ++++++++++++++++++------
 4 files changed, 36 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-26 21:26:39.000000000 +0800
@@ -1110,6 +1110,7 @@ void __mark_inode_dirty(struct inode *in
 			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
+			inode->i_mapping->writeback_cycle_time = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
 
--- linux-next.orig/include/linux/fs.h	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-04-26 21:26:39.000000000 +0800
@@ -639,6 +639,7 @@ struct address_space {
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
+	unsigned long		writeback_cycle_time;
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
--- linux-next.orig/mm/page-writeback.c	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-26 21:33:47.000000000 +0800
@@ -835,6 +835,9 @@ void tag_pages_for_writeback(struct addr
 		cond_resched();
 		/* We check 'start' to handle wrapping when end == ~0UL */
 	} while (tagged >= WRITEBACK_TAG_BATCH && start);
+
+	mapping_set_tagged_sync(mapping);
+	mapping->writeback_cycle_time = jiffies;
 }
 EXPORT_SYMBOL(tag_pages_for_writeback);
 
@@ -872,7 +875,7 @@ int write_cache_pages(struct address_spa
 	pgoff_t end;		/* Inclusive */
 	pgoff_t done_index;
 	int range_whole = 0;
-	int tag;
+	int tag = PAGECACHE_TAG_DIRTY;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -884,13 +887,19 @@ int write_cache_pages(struct address_spa
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 	}
-	if (wbc->sync_mode == WB_SYNC_ALL)
-		tag = PAGECACHE_TAG_TOWRITE;
-	else
-		tag = PAGECACHE_TAG_DIRTY;
+	if (!index)
+		mapping->writeback_cycle_time = jiffies;
 
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL ||
+	    (!mapping_tagged_sync(mapping) &&
+	     jiffies - mapping->host->dirtied_when > 60 * HZ)) {
 		tag_pages_for_writeback(mapping, index, end);
+		tag = PAGECACHE_TAG_TOWRITE;
+	}
+
+	if (mapping_tagged_sync(mapping))
+		tag = PAGECACHE_TAG_TOWRITE;
+
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
@@ -899,6 +908,9 @@ int write_cache_pages(struct address_spa
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0) {
 			done_index = 0;
+			mapping->dirtied_when = mapping->writeback_cycle_time;
+			if (tag == PAGECACHE_TAG_TOWRITE)
+				mapping_clear_tagged_sync(mapping);
 			break;
 		}
 
--- linux-next.orig/include/linux/pagemap.h	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/include/linux/pagemap.h	2011-04-26 21:46:38.000000000 +0800
@@ -24,6 +24,7 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_TAGGED_SYNC	= __GFP_BITS_SHIFT + 4,	/* sync only tagged pages */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -53,6 +54,21 @@ static inline int mapping_unevictable(st
 	return !!mapping;
 }
 
+static inline void mapping_set_tagged_sync(struct address_space *mapping)
+{
+	set_bit(AS_TAGGED_SYNC, &mapping->flags);
+}
+
+static inline void mapping_clear_tagged_sync(struct address_space *mapping)
+{
+	clear_bit(AS_TAGGED_SYNC, &mapping->flags);
+}
+
+static inline int mapping_tagged_sync(struct address_space *mapping)
+{
+	return test_bit(AS_TAGGED_SYNC, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>