Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()

KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> · Thu, 8 Oct 2009 10:01:59 +0900

On Wed, 07 Oct 2009 15:38:36 +0800
Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:

> As proposed by Chris, Dave and Jan, let balance_dirty_pages() wait for
> the per-bdi flusher to writeback enough pages for it, instead of
> starting foreground writeback by itself. By doing so we harvest two
> benefits:
> - avoid concurrent writeback of multiple inodes (Dave Chinner)
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> - avoid one constraint torwards huge per-file nr_to_write
>   The write_chunk used by balance_dirty_pages() should be small enough to
>   prevent user noticeable one-shot latency. Ie. each sleep/wait inside
>   balance_dirty_pages() shall be small enough. When it starts its own
>   writeback, it must specify a small nr_to_write. The throttle wait queue
>   removes this dependancy by the way.
> 

May I ask a question ? (maybe not directly related to this patch itself, sorry)

Recent works as "writeback: switch to per-bdi threads for flushing data"
removed congestion_wait() from balance_dirty_pages() and added
schedule_timeout_interruptible(). 

And this one replaces it with wake_up+wait_queue.

IIUC, "iowait" cpustat data was calculated by runqueue->nr_iowait as
== kernel/schec.c
void account_idle_time(cputime_t cputime)
{
        struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
        cputime64_t cputime64 = cputime_to_cputime64(cputime);
        struct rq *rq = this_rq();

        if (atomic_read(&rq->nr_iowait) > 0)
                cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
        else
                cpustat->idle = cputime64_add(cpustat->idle, cputime64);
}
==
Then, for showing "cpu is in iowait", runqueue->nr_iowait should be modified
at some places. In old kernel, congestion_wait() at el did that by calling
io_schedule_timeout().

How this runqueue->nr_iowait is handled now ?

Thanks,
-Kame

> CC: Chris Mason <chris.mason@xxxxxxxxxx> 
> CC: Dave Chinner <david@xxxxxxxxxxxxx> 
> CC: Jan Kara <jack@xxxxxxx> 
> CC: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> 
> CC: Jens Axboe <jens.axboe@xxxxxxxxxx> 
> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> ---
>  fs/fs-writeback.c           |   71 ++++++++++++++++++++++++++++++++++
>  include/linux/backing-dev.h |   15 +++++++
>  mm/backing-dev.c            |    4 +
>  mm/page-writeback.c         |   53 ++++++-------------------
>  4 files changed, 103 insertions(+), 40 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2009-10-06 23:38:30.000000000 +0800
> +++ linux/mm/page-writeback.c	2009-10-06 23:38:43.000000000 +0800
> @@ -218,6 +218,15 @@ static inline void __bdi_writeout_inc(st
>  {
>  	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
>  			      bdi->max_prop_frac);
> +
> +	/*
> +	 * The DIRTY_THROTTLE_PAGES_STOP test is an optional optimization, so
> +	 * it's OK to be racy. We set DIRTY_THROTTLE_PAGES_STOP*2 in other
> +	 * places to reduce the race possibility.
> +	 */
> +	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
> +	    atomic_dec_and_test(&bdi->throttle_pages))
> +		bdi_writeback_wakeup(bdi);
>  }
>  
>  void bdi_writeout_inc(struct backing_dev_info *bdi)
> @@ -458,20 +467,10 @@ static void balance_dirty_pages(struct a
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long pages_written = 0;
> -	unsigned long pause = 1;
>  	int dirty_exceeded;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
>  	for (;;) {
> -		struct writeback_control wbc = {
> -			.bdi		= bdi,
> -			.sync_mode	= WB_SYNC_NONE,
> -			.older_than_this = NULL,
> -			.nr_to_write	= write_chunk,
> -			.range_cyclic	= 1,
> -		};
> -
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  				 global_page_state(NR_UNSTABLE_NFS);
>  		nr_writeback = global_page_state(NR_WRITEBACK) +
> @@ -518,39 +517,13 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		if (bdi_nr_reclaimable > bdi_thresh) {
> -			writeback_inodes_wbc(&wbc);
> -			pages_written += write_chunk - wbc.nr_to_write;
> -			/* don't wait if we've done enough */
> -			if (pages_written >= write_chunk)
> -				break;
> -		}
> -		schedule_timeout_interruptible(pause);
> -
> -		/*
> -		 * Increase the delay for each loop, up to our previous
> -		 * default of taking a 100ms nap.
> -		 */
> -		pause <<= 1;
> -		if (pause > HZ / 10)
> -			pause = HZ / 10;
> +		bdi_writeback_wait(bdi, write_chunk);
> +		break;
>  	}
>  
>  	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> -	if (writeback_in_progress(bdi))
> -		return;
> -
>  	/*
>  	 * In laptop mode, we wait until hitting the higher threshold before
>  	 * starting background writeout, and then write out all the way down
> @@ -559,8 +532,8 @@ static void balance_dirty_pages(struct a
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> -	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> +	if (!laptop_mode && (nr_reclaimable > background_thresh) &&
> +	    can_submit_background_writeback(bdi))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> --- linux.orig/include/linux/backing-dev.h	2009-10-06 23:38:43.000000000 +0800
> +++ linux/include/linux/backing-dev.h	2009-10-06 23:38:43.000000000 +0800
> @@ -86,6 +86,13 @@ struct backing_dev_info {
>  
>  	struct list_head work_list;
>  
> +	/*
> +	 * dirtier process throttling
> +	 */
> +	spinlock_t		throttle_lock;
> +	struct list_head	throttle_list;	/* nr to sync for each task */
> +	atomic_t		throttle_pages; /* nr to sync for head task */
> +
>  	struct device *dev;
>  
>  #ifdef CONFIG_DEBUG_FS
> @@ -99,6 +106,12 @@ struct backing_dev_info {
>   */
>  #define WB_FLAG_BACKGROUND_WORK		30
>  
> +/*
> + * when no task is throttled, set throttle_pages to larger than this,
> + * to avoid unnecessary atomic decreases.
> + */
> +#define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
> +
>  int bdi_init(struct backing_dev_info *bdi);
>  void bdi_destroy(struct backing_dev_info *bdi);
>  
> @@ -110,6 +123,8 @@ void bdi_start_writeback(struct backing_
>  				long nr_pages);
>  int bdi_writeback_task(struct bdi_writeback *wb);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
> +int bdi_writeback_wakeup(struct backing_dev_info *bdi);
> +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> --- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:43.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-10-06 23:38:43.000000000 +0800
> @@ -25,6 +25,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/backing-dev.h>
>  #include <linux/buffer_head.h>
> +#include <linux/completion.h>
>  #include "internal.h"
>  
>  #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
> @@ -265,6 +266,72 @@ void bdi_start_writeback(struct backing_
>  	bdi_alloc_queue_work(bdi, &args);
>  }
>  
> +struct dirty_throttle_task {
> +	long			nr_pages;
> +	struct list_head	list;
> +	struct completion	complete;
> +};
> +
> +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> +{
> +	struct dirty_throttle_task tt = {
> +		.nr_pages = nr_pages,
> +		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> +	};
> +	unsigned long flags;
> +
> +	/*
> +	 * register throttle pages
> +	 */
> +	spin_lock_irqsave(&bdi->throttle_lock, flags);
> +	if (list_empty(&bdi->throttle_list))
> +		atomic_set(&bdi->throttle_pages, nr_pages);
> +	list_add(&tt.list, &bdi->throttle_list);
> +	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
> +
> +	/*
> +	 * make sure we will be woke up by someone
> +	 */
> +	if (can_submit_background_writeback(bdi))
> +		bdi_start_writeback(bdi, NULL, 0);
> +
> +	wait_for_completion(&tt.complete);
> +}
> +
> +/*
> + * return 1 if there are more waiting tasks.
> + */
> +int bdi_writeback_wakeup(struct backing_dev_info *bdi)
> +{
> +	struct dirty_throttle_task *tt;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bdi->throttle_lock, flags);
> +	/*
> +	 * remove and wakeup head task
> +	 */
> +	if (!list_empty(&bdi->throttle_list)) {
> +		tt = list_entry(bdi->throttle_list.prev,
> +				struct dirty_throttle_task, list);
> +		list_del(&tt->list);
> +		complete(&tt->complete);
> +	}
> +	/*
> +	 * update throttle pages
> +	 */
> +	if (!list_empty(&bdi->throttle_list)) {
> +		tt = list_entry(bdi->throttle_list.prev,
> +				struct dirty_throttle_task, list);
> +		atomic_set(&bdi->throttle_pages, tt->nr_pages);
> +	} else {
> +		tt = NULL;
> +		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
> +	}
> +	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
> +
> +	return tt != NULL;
> +}
> +
>  /*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
> @@ -760,6 +827,10 @@ static long wb_writeback(struct bdi_writ
>  		spin_unlock(&inode_lock);
>  	}
>  
> +	if (args->for_background)
> +		while (bdi_writeback_wakeup(wb->bdi))
> +			;  /* unthrottle all tasks */
> +
>  	return wrote;
>  }
>  
> --- linux.orig/mm/backing-dev.c	2009-10-06 23:37:47.000000000 +0800
> +++ linux/mm/backing-dev.c	2009-10-06 23:38:43.000000000 +0800
> @@ -646,6 +646,10 @@ int bdi_init(struct backing_dev_info *bd
>  	bdi->wb_mask = 1;
>  	bdi->wb_cnt = 1;
>  
> +	spin_lock_init(&bdi->throttle_lock);
> +	INIT_LIST_HEAD(&bdi->throttle_list);
> +	atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
> +
>  	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
>  		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
>  		if (err)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html