Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

Michael Lyle <mlyle@xxxxxxxx> · Wed, 27 Sep 2017 00:47:59 -0700

Tang--

This is a first step towards further stuff I want to do--

1. I want to allow blk_plug-- but to do that, a request needs to know
there are subsequent contiguous requests after it when issuing the
write.  The new structure allows that.
2. I want to allow tuning to issue multiple requests and control IO
priorities for them, so that we can make use of queue depth on backing
devices.  The new code structure allows for inserting heuristics to do
that very easily.  When 4 operations are issued at a time, latency
doesn't suffer very much and throughput can be 30-40% higher.
3. There's a small change to the delay calculation that will allow the
actual inner delay controller to be improved to do lower rates for
laptops and allow backing disk spindown.

But I need more testing and more time to complete those, and there's
already a benefit with the current structure.

Mike

On Wed, Sep 27, 2017 at 12:32 AM,  <tang.junhui@xxxxxxxxxx> wrote:
> From: Tang Junhui <tang.junhui@xxxxxxxxxx>
>
> Hello Mike:
>
> For the second question, I thinks this modification is somewhat complex,
> cannot we do something simple to resolve it? I remember there were some
> patches trying to avoid too small writeback rate, Coly, is there any
> progress now?
>
> -------
> Tang Junhui
>
>> Ah-- re #1 -- I was investigating earlier why not as much was combined
>> as I thought should be when idle.  This is surely a factor.  Thanks
>> for the catch-- KEY_OFFSET is correct.  I will fix and retest.
>>
>> (Under heavy load, the correct thing still happens, but not under
>> light or intermediate load0.
>>
>> About #2-- I wanted to attain a bounded amount of "combining" of
>> operations.  If we have 5 4k extents in a row to dispatch, it seems
>> really wasteful to issue them as 5 IOs 60ms apart, which the existing
>> code would be willing to do-- I'd rather do a 20k write IO (basically
>> the same cost as a 4k write IO) and then sleep 300ms.  It is dependent
>> on the elevator/IO scheduler merging the requests.  At the same time,
>> I'd rather not combine a really large request.
>>
>> It would be really neat to blk_plug the backing device during the
>> write issuance, but that requires further work.
>>
>> Thanks
>>
>> Mike
>>
>> On Tue, Sep 26, 2017 at 11:51 PM,  <tang.junhui@xxxxxxxxxx> wrote:
>> > From: Tang Junhui <tang.junhui@xxxxxxxxxx>
>> >
>> > Hello Lyle:
>> >
>> > Two questions:
>> > 1) In keys_contiguous(), you judge I/O contiguous in cache device, but not
>> > in backing device. I think you should judge it by backing device (remove
>> > PTR_CACHE() and use KEY_OFFSET() instead of PTR_OFFSET()?).
>> >
>> > 2) I did not see you combine samll contiguous I/Os to big I/O, so I think
>> > it is useful when writeback rate was low by avoiding single I/O write, but
>> > have no sense in high writeback rate, since previously it is also write
>> > I/Os asynchronously.
>> >
>> > -----------
>> > Tang Junhui
>> >
>> >> Previously, there was some logic that attempted to immediately issue
>> >> writeback of backing-contiguous blocks when the writeback rate was
>> >> fast.
>> >>
>> >> The previous logic did not have any limits on the aggregate size it
>> >> would issue, nor the number of keys it would combine at once.  It
>> >> would also discard the chance to do a contiguous write when the
>> >> writeback rate was low-- e.g. at "background" writeback of target
>> >> rate = 8, it would not combine two adjacent 4k writes and would
>> >> instead seek the disk twice.
>> >>
>> >> This patch imposes limits and explicitly understands the size of
>> >> contiguous I/O during issue.  It also will combine contiguous I/O
>> >> in all circumstances, not just when writeback is requested to be
>> >> relatively fast.
>> >>
>> >> It is a win on its own, but also lays the groundwork for skip writes to
>> >> short keys to make the I/O more sequential/contiguous.
>> >>
>> >> Signed-off-by: Michael Lyle <mlyle@xxxxxxxx>
>> >> ---
>> >>  drivers/md/bcache/bcache.h    |   6 --
>> >>  drivers/md/bcache/writeback.c | 131 ++++++++++++++++++++++++++++++------------
>> >>  2 files changed, 93 insertions(+), 44 deletions(-)
>> >>
>> >> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> >> index eb83be693d60..da803a3b1981 100644
>> >> --- a/drivers/md/bcache/bcache.h
>> >> +++ b/drivers/md/bcache/bcache.h
>> >> @@ -321,12 +321,6 @@ struct cached_dev {
>> >>                struct bch_ratelimit            writeback_rate;
>> >>                struct delayed_work             writeback_rate_update;
>> >>
>> >> -              /*
>> >> -               * Internal to the writeback code, so read_dirty() can keep track of
>> >> -               * where it's at.
>> >> -               */
>> >> -              sector_t                                last_read;
>> >> -
>> >>                /* Limit number of writeback bios in flight */
>> >>                struct semaphore                in_flight;
>> >>                struct task_struct              *writeback_thread;
>> >> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>> >> index 0b7c89813635..cf29c44605b9 100644
>> >> --- a/drivers/md/bcache/writeback.c
>> >> +++ b/drivers/md/bcache/writeback.c
>> >> @@ -229,10 +229,26 @@ static void read_dirty_submit(struct closure *cl)
>> >>                continue_at(cl, write_dirty, io->dc->writeback_write_wq);
>> >>  }
>> >>
>> >> +static inline bool keys_contiguous(struct cached_dev *dc,
>> >> +                              struct keybuf_key *first, struct keybuf_key *second)
>> >> +{
>> >> +              if (PTR_CACHE(dc->disk.c, &second->key, 0)->bdev !=
>> >> +                                              PTR_CACHE(dc->disk.c, &first->key, 0)->bdev)
>> >> +                              return false;
>> >> +
>> >> +              if (PTR_OFFSET(&second->key, 0) !=
>> >> +                                              (PTR_OFFSET(&first->key, 0) + KEY_SIZE(&first->key)))
>> >> +                              return false;
>> >> +
>> >> +              return true;
>> >> +}
>> >> +
>> >>  static void read_dirty(struct cached_dev *dc)
>> >>  {
>> >>                unsigned delay = 0;
>> >> -              struct keybuf_key *w;
>> >> +              struct keybuf_key *next, *keys[5], *w;
>> >> +              size_t size;
>> >> +              int nk, i;
>> >>                struct dirty_io *io;
>> >>                struct closure cl;
>> >>
>> >> @@ -243,45 +259,84 @@ static void read_dirty(struct cached_dev *dc)
>> >>                 * mempools.
>> >>                 */
>> >>
>> >> -              while (!kthread_should_stop()) {
>> >> -
>> >> -                              w = bch_keybuf_next(&dc->writeback_keys);
>> >> -                              if (!w)
>> >> -                                              break;
>> >> -
>> >> -                              BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
>> >> -
>> >> -                              if (KEY_START(&w->key) != dc->last_read ||
>> >> -                                  jiffies_to_msecs(delay) > 50)
>> >> -                                              while (!kthread_should_stop() && delay)
>> >> -                                                              delay = schedule_timeout_interruptible(delay);
>> >> -
>> >> -                              dc->last_read           = KEY_OFFSET(&w->key);
>> >> -
>> >> -                              io = kzalloc(sizeof(struct dirty_io) + sizeof(struct bio_vec)
>> >> -                                                   * DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
>> >> -                                                   GFP_KERNEL);
>> >> -                              if (!io)
>> >> -                                              goto err;
>> >> -
>> >> -                              w->private              = io;
>> >> -                              io->dc                          = dc;
>> >> -
>> >> -                              dirty_init(w);
>> >> -                              bio_set_op_attrs(&io->bio, REQ_OP_READ, 0);
>> >> -                              io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0);
>> >> -                              bio_set_dev(&io->bio, PTR_CACHE(dc->disk.c, &w->key, 0)->bdev);
>> >> -                              io->bio.bi_end_io               = read_dirty_endio;
>> >> -
>> >> -                              if (bio_alloc_pages(&io->bio, GFP_KERNEL))
>> >> -                                              goto err_free;
>> >> -
>> >> -                              trace_bcache_writeback(&w->key);
>> >> +              next = bch_keybuf_next(&dc->writeback_keys);
>> >> +
>> >> +              while (!kthread_should_stop() && next) {
>> >> +                              size = 0;
>> >> +                              nk = 0;
>> >> +
>> >> +                              do {
>> >> +                                              BUG_ON(ptr_stale(dc->disk.c, &next->key, 0));
>> >> +
>> >> +                                              /* Don't combine too many operations, even if they
>> >> +                                               * are all small.
>> >> +                                               */
>> >> +                                              if (nk >= 5)
>> >> +                                                              break;
>> >> +
>> >> +                                              /* If the current operation is very large, don't
>> >> +                                               * further combine operations.
>> >> +                                               */
>> >> +                                              if (size > 5000)
>> >> +                                                              break;
>> >> +
>> >> +                                              /* Operations are only eligible to be combined
>> >> +                                               * if they are contiguous.
>> >> +                                               *
>> >> +                                               * TODO: add a heuristic willing to fire a
>> >> +                                               * certain amount of non-contiguous IO per pass,
>> >> +                                               * so that we can benefit from backing device
>> >> +                                               * command queueing.
>> >> +                                               */
>> >> +                                              if (nk != 0 && !keys_contiguous(dc, keys[nk-1], next))
>> >> +                                                              break;
>> >> +
>> >> +                                              size += KEY_SIZE(&next->key);
>> >> +                                              keys[nk++] = next;
>> >> +                              } while ((next = bch_keybuf_next(&dc->writeback_keys)));
>> >> +
>> >> +                              /* Now we have gathered a set of 1..5 keys to write back. */
>> >> +
>> >> +                              for (i = 0; i < nk; i++) {
>> >> +                                              w = keys[i];
>> >> +
>> >> +                                              io = kzalloc(sizeof(struct dirty_io) +
>> >> +                                                                   sizeof(struct bio_vec) *
>> >> +                                                                   DIV_ROUND_UP(KEY_SIZE(&w->key), PAGE_SECTORS),
>> >> +                                                                   GFP_KERNEL);
>> >> +                                              if (!io)
>> >> +                                                              goto err;
>> >> +
>> >> +                                              w->private              = io;
>> >> +                                              io->dc                          = dc;
>> >> +
>> >> +                                              dirty_init(w);
>> >> +                                              bio_set_op_attrs(&io->bio, REQ_OP_READ, 0);
>> >> +                                              io->bio.bi_iter.bi_sector = PTR_OFFSET(&w->key, 0);
>> >> +                                              bio_set_dev(&io->bio,
>> >> +                                                                  PTR_CACHE(dc->disk.c, &w->key, 0)->bdev);
>> >> +                                              io->bio.bi_end_io               = read_dirty_endio;
>> >> +
>> >> +                                              if (bio_alloc_pages(&io->bio, GFP_KERNEL))
>> >> +                                                              goto err_free;
>> >> +
>> >> +                                              trace_bcache_writeback(&w->key);
>> >> +
>> >> +                                              down(&dc->in_flight);
>> >> +
>> >> +                                              /* We've acquired a semaphore for the maximum
>> >> +                                               * simultaneous number of writebacks; from here
>> >> +                                               * everything happens asynchronously.
>> >> +                                               */
>> >> +                                              closure_call(&io->cl, read_dirty_submit, NULL, &cl);
>> >> +                              }
>> >>
>> >> -                              down(&dc->in_flight);
>> >> -                              closure_call(&io->cl, read_dirty_submit, NULL, &cl);
>> >> +                              delay = writeback_delay(dc, size);
>> >>
>> >> -                              delay = writeback_delay(dc, KEY_SIZE(&w->key));
>> >> +                              while (!kthread_should_stop() && delay) {
>> >> +                                              schedule_timeout_interruptible(delay);
>> >> +                                              delay = writeback_delay(dc, 0);
>> >> +                              }
>> >>                }
>> >>
>> >>                if (0) {
>> >> --
>> > --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html