I just fixed the problem you pointed out, and ran a test where I did sync;fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=8k --iodepth=256 --size=25G --readwrite=randwrite --ramp_time=4 to generate a lot of 8k extents. After letting things quiet down, writeback looks like this: $ iostat 1 | grep sdb sdb 3.00 0.00 40.00 0 40 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 2.00 0.00 16.00 0 16 sdb 0.00 0.00 0.00 0 0 sdb 3.00 0.00 40.00 0 40 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 3.00 0.00 40.00 0 40 So this is proven to be nicely grouping contiguous extents for writeback in less frequent, mostly-40k chunks. Mike On Wed, Sep 27, 2017 at 12:47 AM, Michael Lyle <mlyle@xxxxxxxx> wrote: > Tang-- > > This is a first step towards further stuff I want to do-- > > 1. I want to allow blk_plug-- but to do that, a request needs to know > there are subsequent contiguous requests after it when issuing the > write. The new structure allows that. > 2. I want to allow tuning to issue multiple requests and control IO > priorities for them, so that we can make use of queue depth on backing > devices. The new code structure allows for inserting heuristics to do > that very easily. When 4 operations are issued at a time, latency > doesn't suffer very much and throughput can be 30-40% higher. > 3. There's a small change to the delay calculation that will allow the > actual inner delay controller to be improved to do lower rates for > laptops and allow backing disk spindown. > > But I need more testing and more time to complete those, and there's > already a benefit with the current structure. > > Mike > > On Wed, Sep 27, 2017 at 12:32 AM, <tang.junhui@xxxxxxxxxx> wrote: >> From: Tang Junhui <tang.junhui@xxxxxxxxxx> >> >> Hello Mike: >> >> For the second question, I thinks this modification is somewhat complex, >> cannot we do something simple to resolve it? I remember there were some >> patches trying to avoid too small writeback rate, Coly, is there any >> progress now? >> >> ------- >> Tang Junhui >> >>> Ah-- re #1 -- I was investigating earlier why not as much was combined >>> as I thought should be when idle. This is surely a factor. Thanks >>> for the catch-- KEY_OFFSET is correct. I will fix and retest. >>> >>> (Under heavy load, the correct thing still happens, but not under >>> light or intermediate load0. >>> >>> About #2-- I wanted to attain a bounded amount of "combining" of >>> operations. If we have 5 4k extents in a row to dispatch, it seems >>> really wasteful to issue them as 5 IOs 60ms apart, which the existing >>> code would be willing to do-- I'd rather do a 20k write IO (basically >>> the same cost as a 4k write IO) and then sleep 300ms. It is dependent >>> on the elevator/IO scheduler merging the requests. At the same time, >>> I'd rather not combine a really large request. >>> >>> It would be really neat to blk_plug the backing device during the >>> write issuance, but that requires further work. >>> >>> Thanks >>> >>> Mike >>> >>> On Tue, Sep 26, 2017 at 11:51 PM, <tang.junhui@xxxxxxxxxx> wrote: >>> > From: Tang Junhui <tang.junhui@xxxxxxxxxx> >>> > >>> > Hello Lyle: >>> > >>> > Two questions: >>> > 1) In keys_contiguous(), you judge I/O contiguous in cache device, but not >>> > in backing device. I think you should judge it by backing device (remove >>> > PTR_CACHE() and use KEY_OFFSET() instead of PTR_OFFSET()?). >>> > >>> > 2) I did not see you combine samll contiguous I/Os to big I/O, so I think >>> > it is useful when writeback rate was low by avoiding single I/O write, but >>> > have no sense in high writeback rate, since previously it is also write >>> > I/Os asynchronously. >>> > >>> > ----------- >>> > Tang Junhui >>> > >>> >> Previously, there was some logic that attempted to immediately issue >>> >> writeback of backing-contiguous blocks when the writeback rate was >>> >> fast. >>> >> >>> >> The previous logic did not have any limits on the aggregate size it >>> >> would issue, nor the number of keys it would combine at once. It >>> >> would also discard the chance to do a contiguous write when the >>> >> writeback rate was low-- e.g. at "background" writeback of target >>> >> rate = 8, it would not combine two adjacent 4k writes and would >>> >> instead seek the disk twice. >>> >> >>> >> This patch imposes limits and explicitly understands the size of >>> >> contiguous I/O during issue. It also will combine contiguous I/O >>> >> in all circumstances, not just when writeback is requested to be >>> >> relatively fast. >>> >> >>> >> It is a win on its own, but also lays the groundwork for skip writes to >>> >> short keys to make the I/O more sequential/contiguous. >>> >> >>> >> Signed-off-by: Michael Lyle <mlyle@xxxxxxxx> >>> >> --- >>> >> drivers/md/bcache/bcache.h | 6 -- >>> >> drivers/md/bcache/writeback.c | 131 ++++++++++++++++++++++++++++++------------ >>> >> 2 files changed, 93 insertions(+), 44 deletions(-) >>> >> >>> >> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h >>> >> index eb83be693d60..da803a3b1981 100644 >>> >> --- a/drivers/md/bcache/bcache.h >>> >> +++ b/drivers/md/bcache/bcache.h >>> >> @@ -321,12 +321,6 @@ struct cached_dev { >>> >> struct bch_ratelimit writeback_rate; >>> >> struct delayed_work writeback_rate_update; >>> >> >>> >> - /* >>> >> - * Internal to the writeback code, so read_dirty() can keep track of >>> >> - * where it's at. >>> >> - */ >>> >> - sector_t last_read; >>> >> - >>> >> /* Limit number of writeback bios in flight */ >>> >> struct semaphore in_flight; >>> >> struct task_struct *writeback_thread; >>> >> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c >>> >> index 0b7c89813635..cf29c44605b9 100644 >>> >> --- a/drivers/md/bcache/writeback.c >>> >> +++ b/drivers/md/bcache/writeback.c >>> >> @@ -229,10 +229,26 @@ static void read_dirty_submit(struct closure *cl) >>> >> continue_at(cl, write_dirty, io->dc->writeback_write_wq); >>> >> } >>> >> >>> >> +static inline bool keys_contiguous(struct cached_dev *dc, >>> >> + struct keybuf_key *first, struct keybuf_key *second) >>> >> +{ >>> >> + if (PTR_CACHE(dc->disk.c, &second->key, 0)->bdev != >>> >> + PTR_CACHE(dc->disk.c, &first->key, 0)->bdev) >>> >> + return false; >>> >> + >>> >> + if (PTR_OFFSET(&second->key, 0) != >>> >> + (PTR_OFFSET(&first->key, 0) + KEY_SIZE(&first->key))) >>> >> + return false; >>> >> + >>> >> + return true; >>> >> +} >>> >> + >>> >> static void read_dirty(struct cached_dev *dc) >>> >> { >>> >> unsigned delay = 0; >>> >> - struct keybuf_key *w; >>> >> + struct keybuf_key *next, *keys[5], *w; >>> >> + size_t size; >>> >> + int nk, i; >>> >> struct dirty_io *io; >>> >> struct closure cl; >>> >> >>> >> @@ -243,45 +259,84 @@ static void read_dirty(struct cached_dev *dc) >>> >> * mempools. >>> >> */ >>> >> >>> >> - while (!kthread_should_stop()) { >>> >> - >>> >> - w = bch_keybuf_next(&dc->writeback_keys); >>> >> - if (!w) >>> >> - break; >>> >> - >>> >> - BUG_ON(ptr_stale(dc->disk.c, &w->key, 0)); >>> >> - >>> >> - if (KEY_START(&w->key) != dc->last_read || >>> >> - jiffies_to_msecs(delay) > 50) >>> >> - while (!kthread_should_stop() && delay) >>> >> - delay = schedule_timeout_interruptible(delay); >>> >> - >>> >> - dc->last_read = KEY_OFFSET(&w->key); >>> >> - >>> >> - io = kzalloc(sizeof(struct dirty_io) + sizeof(struct bio_vec) >>> >> - * DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS), >>> >> - GFP_KERNEL); >>> >> - if (!io) >>> >> - goto err; >>> >> - >>> >> - w->private = io; >>> >> - io->dc = dc; >>> >> - >>> >> - dirty_init(w); >>> >> - bio_set_op_attrs(&io->bio, REQ_OP_READ, 0); >>> >> - io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0); >>> >> - bio_set_dev(&io->bio, PTR_CACHE(dc->disk.c, &w->key, 0)->bdev); >>> >> - io->bio.bi_end_io = read_dirty_endio; >>> >> - >>> >> - if (bio_alloc_pages(&io->bio, GFP_KERNEL)) >>> >> - goto err_free; >>> >> - >>> >> - trace_bcache_writeback(&w->key); >>> >> + next = bch_keybuf_next(&dc->writeback_keys); >>> >> + >>> >> + while (!kthread_should_stop() && next) { >>> >> + size = 0; >>> >> + nk = 0; >>> >> + >>> >> + do { >>> >> + BUG_ON(ptr_stale(dc->disk.c, &next->key, 0)); >>> >> + >>> >> + /* Don't combine too many operations, even if they >>> >> + * are all small. >>> >> + */ >>> >> + if (nk >= 5) >>> >> + break; >>> >> + >>> >> + /* If the current operation is very large, don't >>> >> + * further combine operations. >>> >> + */ >>> >> + if (size > 5000) >>> >> + break; >>> >> + >>> >> + /* Operations are only eligible to be combined >>> >> + * if they are contiguous. >>> >> + * >>> >> + * TODO: add a heuristic willing to fire a >>> >> + * certain amount of non-contiguous IO per pass, >>> >> + * so that we can benefit from backing device >>> >> + * command queueing. >>> >> + */ >>> >> + if (nk != 0 && !keys_contiguous(dc, keys[nk-1], next)) >>> >> + break; >>> >> + >>> >> + size += KEY_SIZE(&next->key); >>> >> + keys[nk++] = next; >>> >> + } while ((next = bch_keybuf_next(&dc->writeback_keys))); >>> >> + >>> >> + /* Now we have gathered a set of 1..5 keys to write back. */ >>> >> + >>> >> + for (i = 0; i < nk; i++) { >>> >> + w = keys[i]; >>> >> + >>> >> + io = kzalloc(sizeof(struct dirty_io) + >>> >> + sizeof(struct bio_vec) * >>> >> + DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS), >>> >> + GFP_KERNEL); >>> >> + if (!io) >>> >> + goto err; >>> >> + >>> >> + w->private = io; >>> >> + io->dc = dc; >>> >> + >>> >> + dirty_init(w); >>> >> + bio_set_op_attrs(&io->bio, REQ_OP_READ, 0); >>> >> + io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0); >>> >> + bio_set_dev(&io->bio, >>> >> + PTR_CACHE(dc->disk.c, &w->key, 0)->bdev); >>> >> + io->bio.bi_end_io = read_dirty_endio; >>> >> + >>> >> + if (bio_alloc_pages(&io->bio, GFP_KERNEL)) >>> >> + goto err_free; >>> >> + >>> >> + trace_bcache_writeback(&w->key); >>> >> + >>> >> + down(&dc->in_flight); >>> >> + >>> >> + /* We've acquired a semaphore for the maximum >>> >> + * simultaneous number of writebacks; from here >>> >> + * everything happens asynchronously. >>> >> + */ >>> >> + closure_call(&io->cl, read_dirty_submit, NULL, &cl); >>> >> + } >>> >> >>> >> - down(&dc->in_flight); >>> >> - closure_call(&io->cl, read_dirty_submit, NULL, &cl); >>> >> + delay = writeback_delay(dc, size); >>> >> >>> >> - delay = writeback_delay(dc, KEY_SIZE(&w->key)); >>> >> + while (!kthread_should_stop() && delay) { >>> >> + schedule_timeout_interruptible(delay); >>> >> + delay = writeback_delay(dc, 0); >>> >> + } >>> >> } >>> >> >>> >> if (0) { >>> >> -- >>> > -- >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html