Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[ CC FS and MM lists ]

Patch looks good to me, however we need to be careful because it's
introducing a new interface. So it's desirable to get some acks from
the FS/MM developers.

Thanks,
Fengguang

On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> From: Namjae Jeon <namjae.jeon@xxxxxxxxxxx>
> 
> This patch is based on suggestion by Wu Fengguang:
> https://lkml.org/lkml/2011/8/19/19
> 
> kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> ratio. It also maintains per task dirty rate limit to keep balance of
> dirty pages at any given instance by doing bdi bandwidth estimation.
> 
> Kernel also has max_ratio/min_ratio tunables to specify percentage of
> writecache to control per bdi dirty limits and task throttling.
> 
> However, there might be a usecase where user wants a per bdi writeback tuning
> parameter to flush dirty data once per bdi dirty data reach a threshold
> especially at NFS server.
> 
> dirty_background_centisecs provides an interface where user can tune
> background writeback start threshold using
> /sys/block/sda/bdi/dirty_background_centisecs
> 
> dirty_background_centisecs is used alongwith average bdi write bandwidth
> estimation to start background writeback.
> 
> One of the use case to demonstrate the patch functionality can be
> on NFS setup:-
> We have a NFS setup with ethernet line of 100Mbps, while the USB
> disk is attached to server, which has a local speed of 25MBps. Server
> and client both are arm target boards.
> 
> Now if we perform a write operation over NFS (client to server), as
> per the network speed, data can travel at max speed of 100Mbps. But
> if we check the default write speed of USB hdd over NFS it comes
> around to 8MB/sec, far below the speed of network.
> 
> Reason being is as per the NFS logic, during write operation, initially
> pages are dirtied on NFS client side, then after reaching the dirty
> threshold/writeback limit (or in case of sync) data is actually sent
> to NFS server (so now again pages are dirtied on server side). This
> will be done in COMMIT call from client to server i.e if 100MB of data
> is dirtied and sent then it will take minimum 100MB/10Mbps ~ 8-9 seconds.
> 
> After the data is received, now it will take approx 100/25 ~4 Seconds to
> write the data to USB Hdd on server side. Hence making the overall time
> to write this much of data ~12 seconds, which in practically comes out to
> be near 7 to 8MB/second. After this a COMMIT response will be sent to NFS
> client.
> 
> However we may improve this write performace by making the use of NFS
> server idle time i.e while data is being received from the client,
> simultaneously initiate the writeback thread on server side. So instead
> of waiting for the complete data to come and then start the writeback,
> we can work in parallel while the network is still busy in receiving the
> data. Hence in this way overall performace will be improved.
> 
> If we tune dirty_background_centisecs, we can see there
> is increase in the performace and it comes out to be ~ 11MB/seconds.
> Results are:-
> 
> Write test(create a 1 GB file) result at 'NFS client' after changing 
> /sys/block/sda/bdi/dirty_background_centisecs 
> on  *** NFS Server only - not on NFS Client ****
> 
> ---------------------------------------------------------------------
> |WRITE Test with various 'dirty_background_centisecs' at NFS Server |
> ---------------------------------------------------------------------
> |          | default = 0 | 300 centisec| 200 centisec| 100 centisec |
> ---------------------------------------------------------------------
> |RecSize   |  WriteSpeed |  WriteSpeed |  WriteSpeed |  WriteSpeed  |
> ---------------------------------------------------------------------
> |10485760  |  8.44MB/sec |  8.60MB/sec |  9.30MB/sec |  10.27MB/sec |
> | 1048576  |  8.48MB/sec |  8.87MB/sec |  9.31MB/sec |  10.34MB/sec |
> |  524288  |  8.37MB/sec |  8.42MB/sec |  9.84MB/sec |  10.47MB/sec |
> |  262144  |  8.16MB/sec |  8.51MB/sec |  9.52MB/sec |  10.62MB/sec |
> |  131072  |  8.48MB/sec |  8.81MB/sec |  9.42MB/sec |  10.55MB/sec |
> |   65536  |  8.38MB/sec |  9.09MB/sec |  9.76MB/sec |  10.53MB/sec |
> |   32768  |  8.65MB/sec |  9.00MB/sec |  9.57MB/sec |  10.54MB/sec |
> |   16384  |  8.27MB/sec |  8.80MB/sec |  9.39MB/sec |  10.43MB/sec |
> |    8192  |  8.52MB/sec |  8.70MB/sec |  9.40MB/sec |  10.50MB/sec |
> |    4096  |  8.20MB/sec |  8.63MB/sec |  9.80MB/sec |  10.35MB/sec |
> ---------------------------------------------------------------------
> 
> we can see, average write speed is increased to ~10-11MB/sec.
> ============================================================
> 
> This patch provides the changes per block devices. So that we may modify the
> dirty_background_centisecs as per the device and overall system is not impacted
> by the changes and we get improved perforamace in certain use cases.
> 
> NOTE: dirty_background_centisecs is used alongwith average bdi write bandwidth
> estimation to start background writeback. But, bdi write bandwidth estimation
> is an _estimation_ and may become wildly wrong. dirty_background_centisecs
> tuning may not always work to the user expectations. dirty_background_centisecs
> will require careful tuning by users on NFS Server.
> As a good use case, dirty_background_time should be set around 100 (1 sec).
> It should not be set to very small value, otherwise it will start
> flushing for small I/O size dirty data.
> 
> Changes since v1:
> * make default value of 'dirty_background_centisecs = 0' sothat there is no change
>   in default writeback behaviour.
> * Add description of dirty_background_centisecs in documentation.
> 
> Original-patch-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> Signed-off-by: Namjae Jeon <namjae.jeon@xxxxxxxxxxx>
> Signed-off-by: Vivek Trivedi <t.vivek@xxxxxxxxxxx>
> ---
>  fs/fs-writeback.c           |   21 +++++++++++++++++++--
>  include/linux/backing-dev.h |    1 +
>  include/linux/writeback.h   |    1 +
>  mm/backing-dev.c            |   21 +++++++++++++++++++++
>  mm/page-writeback.c         |    3 ++-
>  5 files changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index fd255c0..c427130 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -769,6 +769,22 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
>  	return false;
>  }
>  
> +bool over_dirty_bground_time(struct backing_dev_info *bdi)
> +{
> +	unsigned long background_thresh;
> +
> +	if (!bdi->dirty_background_centisecs)
> +		return false;
> +
> +	background_thresh = bdi->avg_write_bandwidth *
> +				bdi->dirty_background_centisecs / 100;
> +
> +	if (bdi_stat(bdi, BDI_RECLAIMABLE) > background_thresh)
> +		return true;
> +
> +	return false;
> +}
> +
>  /*
>   * Called under wb->list_lock. If there are multiple wb per bdi,
>   * only the flusher working on the first wb should do it.
> @@ -828,7 +844,8 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 * For background writeout, stop when we are below the
>  		 * background dirty threshold
>  		 */
> -		if (work->for_background && !over_bground_thresh(wb->bdi))
> +		if (work->for_background && !over_bground_thresh(wb->bdi) &&
> +			!over_dirty_bground_time(wb->bdi))
>  			break;
>  
>  		/*
> @@ -920,7 +937,7 @@ static unsigned long get_nr_dirty_pages(void)
>  
>  static long wb_check_background_flush(struct bdi_writeback *wb)
>  {
> -	if (over_bground_thresh(wb->bdi)) {
> +	if (over_bground_thresh(wb->bdi) || over_dirty_bground_time(wb->bdi)) {
>  
>  		struct wb_writeback_work work = {
>  			.nr_pages	= LONG_MAX,
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 2a9a9ab..43d2e15 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -95,6 +95,7 @@ struct backing_dev_info {
>  
>  	unsigned int min_ratio;
>  	unsigned int max_ratio, max_prop_frac;
> +	unsigned int dirty_background_centisecs;
>  
>  	struct bdi_writeback wb;  /* default writeback info for this bdi */
>  	spinlock_t wb_lock;	  /* protects work_list */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 50c3e8f..6dc2abe 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -96,6 +96,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
>  long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
>  void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
>  void inode_wait_for_writeback(struct inode *inode);
> +bool over_dirty_bground_time(struct backing_dev_info *bdi);
>  
>  /* writeback.h requires fs.h; it, too, is not included from here. */
>  static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index d3ca2b3..b1b2fd2 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -221,12 +221,32 @@ static ssize_t max_ratio_store(struct device *dev,
>  }
>  BDI_SHOW(max_ratio, bdi->max_ratio)
>  
> +static ssize_t dirty_background_centisecs_store(struct device *dev,
> +		struct device_attribute *attr, const char *buf, size_t count)
> +{
> +	struct backing_dev_info *bdi = dev_get_drvdata(dev);
> +	unsigned int centisecs;
> +	ssize_t ret;
> +
> +	ret = kstrtouint(buf, 10, &centisecs);
> +	if (ret < 0)
> +		return ret;
> +
> +	bdi->dirty_background_centisecs = centisecs;
> +	if (over_dirty_bground_time(bdi))
> +		bdi_start_background_writeback(bdi);
> +
> +	return count;
> +}
> +BDI_SHOW(dirty_background_centisecs, bdi->dirty_background_centisecs)
> +
>  #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
>  
>  static struct device_attribute bdi_dev_attrs[] = {
>  	__ATTR_RW(read_ahead_kb),
>  	__ATTR_RW(min_ratio),
>  	__ATTR_RW(max_ratio),
> +	__ATTR_RW(dirty_background_centisecs),
>  	__ATTR_NULL,
>  };
>  
> @@ -628,6 +648,7 @@ int bdi_init(struct backing_dev_info *bdi)
>  	bdi->min_ratio = 0;
>  	bdi->max_ratio = 100;
>  	bdi->max_prop_frac = FPROP_FRAC_BASE;
> +	bdi->dirty_background_centisecs = 0;
>  	spin_lock_init(&bdi->wb_lock);
>  	INIT_LIST_HEAD(&bdi->bdi_list);
>  	INIT_LIST_HEAD(&bdi->work_list);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5ad5ce2..8c1530d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1403,7 +1403,8 @@ pause:
>  	if (laptop_mode)
>  		return;
>  
> -	if (nr_reclaimable > background_thresh)
> +	if (nr_reclaimable > background_thresh ||
> +			over_dirty_bground_time(bdi))
>  		bdi_start_background_writeback(bdi);
>  }
>  
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]