Re: [PATCH v5] blk-throtl: Introduce sync and async queues for blk-throtl

Tejun Heo <tj@xxxxxxxxxx> · Wed, 11 Jan 2023 07:05:25 -1000

On Thu, Jan 12, 2023 at 12:20:30AM +0800, Jinke Han wrote:
> From: Jinke Han <hanjinke.666@xxxxxxxxxxxxx>
> 
> Now we don't distinguish sync write ios from normal buffer write ios
> in blk-throtl. A bio with REQ_SYNC tagged always mean it will be wait
> until write completion soon after it submit. So it's reasonable for sync
> io to complete as soon as possible.
> 
> In our test, fio writes a 100g file in sequential 4k blocksize in
> a container with low bps limit configured (wbps=10M). More than 1200
> ios were throttled in blk-throtl queue and the avarage throtle time
> of each io is 140s. At the same time, the operation of saving a small
> file by vim will be blocked amolst 140s. As a fsync will be send by vim,
> the sync ios of fsync will be blocked by a huge amount of buffer write
> ios ahead. This is also a priority inversion problem within one cgroup.
> In the database scene, things got really bad with blk-throtle enabled
> as fsync is called very often.
> 
> This patch splits bio queue into sync and async queues for blk-throtl
> and gives a huge priority to sync write ios. Sync queue only make sense
> for write ios as we treat all read io as sync io. I think it's a nice
> respond to the semantics of REQ_SYNC. Bios with REQ_META and REQ_PRIO
> gains the same priority as they are important to fs. This may avoid
> some potential priority inversion problems.
> 
> With this patch, do the same test above, the duration of the fsync sent
> by vim drops to several hundreds of milliseconds.
> 
> Signed-off-by: Jinke Han <hanjinke.666@xxxxxxxxxxxxx>
> Signed-off-by: Xin Yin <yinxin.x@xxxxxxxxxxxxx>

Acked-by: Tejun Heo <tj@xxxxxxxxxx>

with some nits below:

> +/**
> + * throtl_qnode_bio_peek - peek a bio from a qn
> + * @qn: the qnode to peek from
> + *
> + * For read, always peek bio from the SYNC queue.
> + *
> + * For write, we always peek bio from next_to_disp. If it's NULL, a bio
                    ^
                    first 

> + * will be popped from SYNC or ASYNC queue to fill it. The next_to_disp
> + * is used to make sure that the peeked bio and the next popped bio are
                                   ^
                                   previously

> + * always the same even in case that the spinlock of queue was released
> + * and re-holded.
          ^
          re-grabbed / re-acquired
> + *
> + * Without the next_to_disp, consider the following situation:
      ^^^^^^^^^^^^^^^^^^^^^^^^^^
      maybe drop this part and move the latter part to the end of the
      previous para?

> + * Assumed that there are only bios queued in ASYNC queue and the SYNC
      ^
      Assume

> + * queue is empty and all ASYNC bios are 1M in size and the bps limit is
> + * 1M/s. The throtl_slice is 100ms. The current slice is [jiffies1,
> + * jiffies1+100] and the bytes_disp[w] is 0.
> + *
> + * The disp_sync_cnt is 0 as it was set 0 after each dispatching of a
> + * ASYNC bio. A ASYNC bio wil be peeked to check in tg_may_dispatch.
> + * Obviously, it can't be dispatched in current slice and the wait time
> + * is 900ms. The slice will be extended to [jiffies1, jiffies1+1000] in
> + * tg_may_dispatch. The spinlock of the queue will be released after the
> + * process of dispatch giving up. A 4k size SYNC bio was queued in and
> + * the SYNC queue becomes no-empty. After 900ms, it's time to dispatch
> + * the tg, the SYNC bio will be popped to dispatched as the disp_sync_cnt
> + * is 0 and the SYNC queue is no-empty. The slice will be extended to
      ^
 Maybe combine the previous several sentences like:

 The queue lock is released and a 4k SYNC bio gets queued during the 900ms
 wait.

> + * [jiffies1, jiffies1+1100] in tg_may_dispatch. Then the slice will be
> + * trimed to [jiffies1+1000, jiffies1+1100] after the SYNC bio was
> + * dispatched. Then the former 1M size ASYNC bio will be peeked to be
> + * checked and still can't be dispatched because of overlimit within
> + * the current slice. The same thing may happen DISPACH_SYNC_FACTOR times
> + * if always there is a SYNC bio be queued in the SYNC queue when the
> + * ASYNC bio is waiting. This means that in nearly 5s, we have dispathed
> + * four 4k SYNC bios and one  1M ASYNC bio. It is hard to fill up the
> + * bandwidth considering that the bps limit is 1M/s.

Simiarly I think the information can be conveyed in a more compact form.

Thanks.

-- 
tejun