From: Tang Junhui <tang.junhui@xxxxxxxxxx> Hello Mike: For the second question, I thinks this modification is somewhat complex, cannot we do something simple to resolve it? I remember there were some patches trying to avoid too small writeback rate, Coly, is there any progress now? ------- Tang Junhui > Ah-- re #1 -- I was investigating earlier why not as much was combined > as I thought should be when idle. This is surely a factor. Thanks > for the catch-- KEY_OFFSET is correct. I will fix and retest. > > (Under heavy load, the correct thing still happens, but not under > light or intermediate load0. > > About #2-- I wanted to attain a bounded amount of "combining" of > operations. If we have 5 4k extents in a row to dispatch, it seems > really wasteful to issue them as 5 IOs 60ms apart, which the existing > code would be willing to do-- I'd rather do a 20k write IO (basically > the same cost as a 4k write IO) and then sleep 300ms. It is dependent > on the elevator/IO scheduler merging the requests. At the same time, > I'd rather not combine a really large request. > > It would be really neat to blk_plug the backing device during the > write issuance, but that requires further work. > > Thanks > > Mike > > On Tue, Sep 26, 2017 at 11:51 PM, <tang.junhui@xxxxxxxxxx> wrote: > > From: Tang Junhui <tang.junhui@xxxxxxxxxx> > > > > Hello Lyle: > > > > Two questions: > > 1) In keys_contiguous(), you judge I/O contiguous in cache device, but not > > in backing device. I think you should judge it by backing device (remove > > PTR_CACHE() and use KEY_OFFSET() instead of PTR_OFFSET()?). > > > > 2) I did not see you combine samll contiguous I/Os to big I/O, so I think > > it is useful when writeback rate was low by avoiding single I/O write, but > > have no sense in high writeback rate, since previously it is also write > > I/Os asynchronously. > > > > ----------- > > Tang Junhui > > > >> Previously, there was some logic that attempted to immediately issue > >> writeback of backing-contiguous blocks when the writeback rate was > >> fast. > >> > >> The previous logic did not have any limits on the aggregate size it > >> would issue, nor the number of keys it would combine at once. It > >> would also discard the chance to do a contiguous write when the > >> writeback rate was low-- e.g. at "background" writeback of target > >> rate = 8, it would not combine two adjacent 4k writes and would > >> instead seek the disk twice. > >> > >> This patch imposes limits and explicitly understands the size of > >> contiguous I/O during issue. It also will combine contiguous I/O > >> in all circumstances, not just when writeback is requested to be > >> relatively fast. > >> > >> It is a win on its own, but also lays the groundwork for skip writes to > >> short keys to make the I/O more sequential/contiguous. > >> > >> Signed-off-by: Michael Lyle <mlyle@xxxxxxxx> > >> --- > >> drivers/md/bcache/bcache.h | 6 -- > >> drivers/md/bcache/writeback.c | 131 ++++++++++++++++++++++++++++++------------ > >> 2 files changed, 93 insertions(+), 44 deletions(-) > >> > >> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h > >> index eb83be693d60..da803a3b1981 100644 > >> --- a/drivers/md/bcache/bcache.h > >> +++ b/drivers/md/bcache/bcache.h > >> @@ -321,12 +321,6 @@ struct cached_dev { > >> struct bch_ratelimit writeback_rate; > >> struct delayed_work writeback_rate_update; > >> > >> - /* > >> - * Internal to the writeback code, so read_dirty() can keep track of > >> - * where it's at. > >> - */ > >> - sector_t last_read; > >> - > >> /* Limit number of writeback bios in flight */ > >> struct semaphore in_flight; > >> struct task_struct *writeback_thread; > >> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c > >> index 0b7c89813635..cf29c44605b9 100644 > >> --- a/drivers/md/bcache/writeback.c > >> +++ b/drivers/md/bcache/writeback.c > >> @@ -229,10 +229,26 @@ static void read_dirty_submit(struct closure *cl) > >> continue_at(cl, write_dirty, io->dc->writeback_write_wq); > >> } > >> > >> +static inline bool keys_contiguous(struct cached_dev *dc, > >> + struct keybuf_key *first, struct keybuf_key *second) > >> +{ > >> + if (PTR_CACHE(dc->disk.c, &second->key, 0)->bdev != > >> + PTR_CACHE(dc->disk.c, &first->key, 0)->bdev) > >> + return false; > >> + > >> + if (PTR_OFFSET(&second->key, 0) != > >> + (PTR_OFFSET(&first->key, 0) + KEY_SIZE(&first->key))) > >> + return false; > >> + > >> + return true; > >> +} > >> + > >> static void read_dirty(struct cached_dev *dc) > >> { > >> unsigned delay = 0; > >> - struct keybuf_key *w; > >> + struct keybuf_key *next, *keys[5], *w; > >> + size_t size; > >> + int nk, i; > >> struct dirty_io *io; > >> struct closure cl; > >> > >> @@ -243,45 +259,84 @@ static void read_dirty(struct cached_dev *dc) > >> * mempools. > >> */ > >> > >> - while (!kthread_should_stop()) { > >> - > >> - w = bch_keybuf_next(&dc->writeback_keys); > >> - if (!w) > >> - break; > >> - > >> - BUG_ON(ptr_stale(dc->disk.c, &w->key, 0)); > >> - > >> - if (KEY_START(&w->key) != dc->last_read || > >> - jiffies_to_msecs(delay) > 50) > >> - while (!kthread_should_stop() && delay) > >> - delay = schedule_timeout_interruptible(delay); > >> - > >> - dc->last_read = KEY_OFFSET(&w->key); > >> - > >> - io = kzalloc(sizeof(struct dirty_io) + sizeof(struct bio_vec) > >> - * DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS), > >> - GFP_KERNEL); > >> - if (!io) > >> - goto err; > >> - > >> - w->private = io; > >> - io->dc = dc; > >> - > >> - dirty_init(w); > >> - bio_set_op_attrs(&io->bio, REQ_OP_READ, 0); > >> - io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0); > >> - bio_set_dev(&io->bio, PTR_CACHE(dc->disk.c, &w->key, 0)->bdev); > >> - io->bio.bi_end_io = read_dirty_endio; > >> - > >> - if (bio_alloc_pages(&io->bio, GFP_KERNEL)) > >> - goto err_free; > >> - > >> - trace_bcache_writeback(&w->key); > >> + next = bch_keybuf_next(&dc->writeback_keys); > >> + > >> + while (!kthread_should_stop() && next) { > >> + size = 0; > >> + nk = 0; > >> + > >> + do { > >> + BUG_ON(ptr_stale(dc->disk.c, &next->key, 0)); > >> + > >> + /* Don't combine too many operations, even if they > >> + * are all small. > >> + */ > >> + if (nk >= 5) > >> + break; > >> + > >> + /* If the current operation is very large, don't > >> + * further combine operations. > >> + */ > >> + if (size > 5000) > >> + break; > >> + > >> + /* Operations are only eligible to be combined > >> + * if they are contiguous. > >> + * > >> + * TODO: add a heuristic willing to fire a > >> + * certain amount of non-contiguous IO per pass, > >> + * so that we can benefit from backing device > >> + * command queueing. > >> + */ > >> + if (nk != 0 && !keys_contiguous(dc, keys[nk-1], next)) > >> + break; > >> + > >> + size += KEY_SIZE(&next->key); > >> + keys[nk++] = next; > >> + } while ((next = bch_keybuf_next(&dc->writeback_keys))); > >> + > >> + /* Now we have gathered a set of 1..5 keys to write back. */ > >> + > >> + for (i = 0; i < nk; i++) { > >> + w = keys[i]; > >> + > >> + io = kzalloc(sizeof(struct dirty_io) + > >> + sizeof(struct bio_vec) * > >> + DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS), > >> + GFP_KERNEL); > >> + if (!io) > >> + goto err; > >> + > >> + w->private = io; > >> + io->dc = dc; > >> + > >> + dirty_init(w); > >> + bio_set_op_attrs(&io->bio, REQ_OP_READ, 0); > >> + io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0); > >> + bio_set_dev(&io->bio, > >> + PTR_CACHE(dc->disk.c, &w->key, 0)->bdev); > >> + io->bio.bi_end_io = read_dirty_endio; > >> + > >> + if (bio_alloc_pages(&io->bio, GFP_KERNEL)) > >> + goto err_free; > >> + > >> + trace_bcache_writeback(&w->key); > >> + > >> + down(&dc->in_flight); > >> + > >> + /* We've acquired a semaphore for the maximum > >> + * simultaneous number of writebacks; from here > >> + * everything happens asynchronously. > >> + */ > >> + closure_call(&io->cl, read_dirty_submit, NULL, &cl); > >> + } > >> > >> - down(&dc->in_flight); > >> - closure_call(&io->cl, read_dirty_submit, NULL, &cl); > >> + delay = writeback_delay(dc, size); > >> > >> - delay = writeback_delay(dc, KEY_SIZE(&w->key)); > >> + while (!kthread_should_stop() && delay) { > >> + schedule_timeout_interruptible(delay); > >> + delay = writeback_delay(dc, 0); > >> + } > >> } > >> > >> if (0) { > >> -- > > --