Re: [PATCH 15/19] bcache: fix issue of writeback rate at minimum 1 key per second

Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> · Fri, 27 Oct 2017 19:09:25 +0000 (UTC)

[+cc Michael Lyle]

On Fri, 27 Oct 2017, Eric Wheeler wrote:

> On Sun, 16 Jul 2017, Coly Li wrote:
> 
> > On 2017/7/1 上午4:43, bcache@xxxxxxxxxxxxxxxxxx wrote:
> > > From: Tang Junhui <tang.junhui@xxxxxxxxxx>
> > > 
> > > When there is not enough dirty data in writeback cache,
> > > writeback rate is at minimum 1 key per second
> > > util all dirty data to be cleaned, it is inefficiency,
> > > and also causes waste of energy;
> > 
> > Hi Junhui and Eric,
> > 
> > What: /sys/block/<disk>/bcache/writeback_percent
> > Description:
> >       For backing devices: If nonzero, writeback from cache to
> >       backing device only takes place when more than this percentage
> >       of the cache is used, allowing more write coalescing to take
> >       place and reducing total number of writes sent to the backing
> >       device. Integer between 0 and 40.
> > 
> > I see above text from Documentation/ABI/testing/sysfs-block-bcache (I
> > know this document is quite old), it seems if "not enough" means dirty
> > data percentage is less then writback_percent, bcache should not
> > performance writeback I/O. But in __update_writeback_rate(),
> > writeback_rate.rate is clamped in [1, NSEC_PER_MSEC]. It seems in PD
> > controller code of __update_writeback_rate(), writeback_percent is only
> > used to calculate dirty target number, its another functionality as
> > writeback threshold is not handled here.
> > 
> > > 
> > > in this patch, When there is not enough dirty data,
> > > let the writeback rate to be 0, and writeback re-schedule
> > > in bch_writeback_thread() periodically with schedule_timeout(),
> > > the behaviors are as follows :
> > > 
> > > 1) If no dirty data have been read into dc->writeback_keys,
> > > goto step 2), otherwise keep writing these dirty data to
> > > back-end device at 1 key per second, until all these dirty data
> > > write over, then goto step 2).
> > > 
> > > 2) Loop in bch_writeback_thread() to check if there is enough
> > > dirty data for writeback. if there is not enough diry data for
> > > writing, then sleep 10 seconds, otherwise, write dirty data to
> > > back-end device.
> > 
> > Bcache uses a Proportion-Differentiation Controller to control writeback
> > rate. When dirty data is far from target, writeback rate is higher; when
> > dirty data is close to target, writeback rate is slower. The advantage
> > of PD controller here is, when regular I/O and writeback I/O happens in
> > same time,
> > - When there are a lot of dirty data, writeback I/O can have more chance
> > to write them back to cached device, which in turns has positive impact
> > to regular I/O.
> > - When dirty data is decreased and close to target dirty number, less
> > writeback I/O can help regular I/O has better throughput and latency.
> > 
> > The root cause of 1 key per second is, the PD controller is designed for
> > better I/O performance, not less energy consumption. When the existing
> > dirty data gets closed to target dirty number, the PD controller chooses
> > to use longer writeback time to make a better regular I/O performance.
> > If it is designed for less energy consumption, it should keep the
> > writeback rate in a high level and finish writing back all dirty data as
> > soon as possible.
> > 
> > This patch may introduce an unexpected behavior of dirty data writeback
> > throughput, when regular write I/O and writeback I/O happen in same
> > time. In this case, dirty data number may shake up and down around
> > target dirty number, it is possible that change (the variable in
> > __update_writeback_rate()) is a minus value, and the result of
> > dc->writeback_rate.rate + change happens to be 0. This patch changes the
> > clamp range of writeback_rate.rate to [0, NSEC_PER_MSEC], so
> > writeback_rate.rate can be possible to be 0. And in bch_next_delay() if
> > d->rate is zero, the write back I/O will be delayed to now +
> > NSEC_PER_SEC. When there is no regular I/O it works well, but when there
> > is regular I/O, this longer delay may cause more dirty data piled in
> > cache device, and PD controller cannot generage a stable writeback rate.
> > This is not an expected behavior for the writeback rate PD controller.
> > 
> > Another method to fix might be,
> > 1) define a sysfs to define writeback_rate with max/dynamic option.
> > 2) dynamic writeback_rate as default
> > 3) when max is set, in __update_writeback_rate() assign NSEC_PER_MSEC to
> > writeback_rate.rate
> > 4) in bch_writeback_thread(), if no writeback I/O on fly, and dirty data
> > does not reach dc->writeback_percent, schedule out.
> > 5) if writeback is necessary then do it in max rate and finish it as
> > soon as possible, to save laptop energy.
> > 
> > The above method might be helpful to energy save as well (perform dirty
> > dat write back in batch), and does not change default PD controller
> > behavior.
> > 
> > Just for your reference. Or if you are too busy to look at it, I can try
> > to compose a patch for review.
> 
> Hi Coli,
> 
> Did this go anywere?  I think the 1-key/sec fix is a good idea and your 
> suggestion will help out mobile users.
> 
> 
> 
> --
> Eric Wheeler
> 
> 
> > 
> > Coly
> > 
> > > 
> > > Signed-off-by: Tang Junhui <tang.junhui@xxxxxxxxxx>
> > > ---
> > >  drivers/md/bcache/util.c      |  9 ++++++++-
> > >  drivers/md/bcache/writeback.c | 11 +++++++----
> > >  2 files changed, 15 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> > > index 8c3a938..49dcf09 100644
> > > --- a/drivers/md/bcache/util.c
> > > +++ b/drivers/md/bcache/util.c
> > > @@ -210,7 +210,14 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done)
> > >  {
> > >  	uint64_t now = local_clock();
> > >  
> > > -	d->next += div_u64(done * NSEC_PER_SEC, d->rate);
> > > +	/*
> > > +	  if d->rate is zero, write the left dirty data
> > > +	  at the speed of one key per second
> > > +	*/
> > > +	if(!d->rate)
> > > +		d->next = now + NSEC_PER_SEC;
> > > +	else
> > > +		d->next += div_u64(done * NSEC_PER_SEC, d->rate);
> > >  
> > >  	if (time_before64(now + NSEC_PER_SEC, d->next))
> > >  		d->next = now + NSEC_PER_SEC;
> > > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> > > index 25289e4..4104eaa 100644
> > > --- a/drivers/md/bcache/writeback.c
> > > +++ b/drivers/md/bcache/writeback.c
> > > @@ -16,6 +16,8 @@
> > >  #include <linux/sched/clock.h>
> > >  #include <trace/events/bcache.h>
> > >  
> > > +#define WRITE_BACK_WAIT_CYCLE		10 * HZ
> > > +
> > >  /* Rate limiting */
> > >  
> > >  static void __update_writeback_rate(struct cached_dev *dc)
> > > @@ -55,13 +57,14 @@ static void __update_writeback_rate(struct cached_dev *dc)
> > >  
> > >  	/* Don't increase writeback rate if the device isn't keeping up */
> > >  	if (change > 0 &&
> > > +	    dc->writeback_rate.rate >0 &&
> > >  	    time_after64(local_clock(),
> > >  			 dc->writeback_rate.next + NSEC_PER_MSEC))
> > >  		change = 0;
> > >  
> > >  	dc->writeback_rate.rate =
> > >  		clamp_t(int64_t, (int64_t) dc->writeback_rate.rate + change,
> > > -			1, NSEC_PER_MSEC);
> > > +			0, NSEC_PER_MSEC);
> > >  
> > >  	dc->writeback_rate_proportional = proportional;
> > >  	dc->writeback_rate_derivative = derivative;
> > > @@ -420,15 +423,15 @@ static int bch_writeback_thread(void *arg)
> > >  	while (!kthread_should_stop()) {
> > >  		down_write(&dc->writeback_lock);
> > >  		if (!atomic_read(&dc->has_dirty) ||
> > > -		    (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) &&
> > > -		     !dc->writeback_running)) {
> > > +		    ((!dc->writeback_rate.rate || !dc->writeback_running) &&
> > > +		      !test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))) {
> > >  			up_write(&dc->writeback_lock);
> > >  			set_current_state(TASK_INTERRUPTIBLE);
> > >  
> > >  			if (kthread_should_stop())
> > >  				return 0;
> > >  
> > > -			schedule();
> > > +			schedule_timeout(WRITE_BACK_WAIT_CYCLE);
> > >  			continue;
> > >  		}
> > >  
> > > 
> > 
> > 
> > -- 
> > Coly Li
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >