On 2017/7/1 上午4:43, bcache@xxxxxxxxxxxxxxxxxx wrote: > From: Tang Junhui <tang.junhui@xxxxxxxxxx> > > When there is not enough dirty data in writeback cache, > writeback rate is at minimum 1 key per second > util all dirty data to be cleaned, it is inefficiency, > and also causes waste of energy; Hi Junhui and Eric, What: /sys/block/<disk>/bcache/writeback_percent Description: For backing devices: If nonzero, writeback from cache to backing device only takes place when more than this percentage of the cache is used, allowing more write coalescing to take place and reducing total number of writes sent to the backing device. Integer between 0 and 40. I see above text from Documentation/ABI/testing/sysfs-block-bcache (I know this document is quite old), it seems if "not enough" means dirty data percentage is less then writback_percent, bcache should not performance writeback I/O. But in __update_writeback_rate(), writeback_rate.rate is clamped in [1, NSEC_PER_MSEC]. It seems in PD controller code of __update_writeback_rate(), writeback_percent is only used to calculate dirty target number, its another functionality as writeback threshold is not handled here. > > in this patch, When there is not enough dirty data, > let the writeback rate to be 0, and writeback re-schedule > in bch_writeback_thread() periodically with schedule_timeout(), > the behaviors are as follows : > > 1) If no dirty data have been read into dc->writeback_keys, > goto step 2), otherwise keep writing these dirty data to > back-end device at 1 key per second, until all these dirty data > write over, then goto step 2). > > 2) Loop in bch_writeback_thread() to check if there is enough > dirty data for writeback. if there is not enough diry data for > writing, then sleep 10 seconds, otherwise, write dirty data to > back-end device. Bcache uses a Proportion-Differentiation Controller to control writeback rate. When dirty data is far from target, writeback rate is higher; when dirty data is close to target, writeback rate is slower. The advantage of PD controller here is, when regular I/O and writeback I/O happens in same time, - When there are a lot of dirty data, writeback I/O can have more chance to write them back to cached device, which in turns has positive impact to regular I/O. - When dirty data is decreased and close to target dirty number, less writeback I/O can help regular I/O has better throughput and latency. The root cause of 1 key per second is, the PD controller is designed for better I/O performance, not less energy consumption. When the existing dirty data gets closed to target dirty number, the PD controller chooses to use longer writeback time to make a better regular I/O performance. If it is designed for less energy consumption, it should keep the writeback rate in a high level and finish writing back all dirty data as soon as possible. This patch may introduce an unexpected behavior of dirty data writeback throughput, when regular write I/O and writeback I/O happen in same time. In this case, dirty data number may shake up and down around target dirty number, it is possible that change (the variable in __update_writeback_rate()) is a minus value, and the result of dc->writeback_rate.rate + change happens to be 0. This patch changes the clamp range of writeback_rate.rate to [0, NSEC_PER_MSEC], so writeback_rate.rate can be possible to be 0. And in bch_next_delay() if d->rate is zero, the write back I/O will be delayed to now + NSEC_PER_SEC. When there is no regular I/O it works well, but when there is regular I/O, this longer delay may cause more dirty data piled in cache device, and PD controller cannot generage a stable writeback rate. This is not an expected behavior for the writeback rate PD controller. Another method to fix might be, 1) define a sysfs to define writeback_rate with max/dynamic option. 2) dynamic writeback_rate as default 3) when max is set, in __update_writeback_rate() assign NSEC_PER_MSEC to writeback_rate.rate 4) in bch_writeback_thread(), if no writeback I/O on fly, and dirty data does not reach dc->writeback_percent, schedule out. 5) if writeback is necessary then do it in max rate and finish it as soon as possible, to save laptop energy. The above method might be helpful to energy save as well (perform dirty dat write back in batch), and does not change default PD controller behavior. Just for your reference. Or if you are too busy to look at it, I can try to compose a patch for review. Coly > > Signed-off-by: Tang Junhui <tang.junhui@xxxxxxxxxx> > --- > drivers/md/bcache/util.c | 9 ++++++++- > drivers/md/bcache/writeback.c | 11 +++++++---- > 2 files changed, 15 insertions(+), 5 deletions(-) > > diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c > index 8c3a938..49dcf09 100644 > --- a/drivers/md/bcache/util.c > +++ b/drivers/md/bcache/util.c > @@ -210,7 +210,14 @@ uint64_t bch_next_delay(struct bch_ratelimit *d, uint64_t done) > { > uint64_t now = local_clock(); > > - d->next += div_u64(done * NSEC_PER_SEC, d->rate); > + /* > + if d->rate is zero, write the left dirty data > + at the speed of one key per second > + */ > + if(!d->rate) > + d->next = now + NSEC_PER_SEC; > + else > + d->next += div_u64(done * NSEC_PER_SEC, d->rate); > > if (time_before64(now + NSEC_PER_SEC, d->next)) > d->next = now + NSEC_PER_SEC; > diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c > index 25289e4..4104eaa 100644 > --- a/drivers/md/bcache/writeback.c > +++ b/drivers/md/bcache/writeback.c > @@ -16,6 +16,8 @@ > #include <linux/sched/clock.h> > #include <trace/events/bcache.h> > > +#define WRITE_BACK_WAIT_CYCLE 10 * HZ > + > /* Rate limiting */ > > static void __update_writeback_rate(struct cached_dev *dc) > @@ -55,13 +57,14 @@ static void __update_writeback_rate(struct cached_dev *dc) > > /* Don't increase writeback rate if the device isn't keeping up */ > if (change > 0 && > + dc->writeback_rate.rate >0 && > time_after64(local_clock(), > dc->writeback_rate.next + NSEC_PER_MSEC)) > change = 0; > > dc->writeback_rate.rate = > clamp_t(int64_t, (int64_t) dc->writeback_rate.rate + change, > - 1, NSEC_PER_MSEC); > + 0, NSEC_PER_MSEC); > > dc->writeback_rate_proportional = proportional; > dc->writeback_rate_derivative = derivative; > @@ -420,15 +423,15 @@ static int bch_writeback_thread(void *arg) > while (!kthread_should_stop()) { > down_write(&dc->writeback_lock); > if (!atomic_read(&dc->has_dirty) || > - (!test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) && > - !dc->writeback_running)) { > + ((!dc->writeback_rate.rate || !dc->writeback_running) && > + !test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags))) { > up_write(&dc->writeback_lock); > set_current_state(TASK_INTERRUPTIBLE); > > if (kthread_should_stop()) > return 0; > > - schedule(); > + schedule_timeout(WRITE_BACK_WAIT_CYCLE); > continue; > } > > -- Coly Li -- To unsubscribe from this list: send the line "unsubscribe linux-bcache" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html