Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Mon, 24 Feb 2025 20:03:32 +0800

Hi,

在 2025/02/24 16:56, Ming Lei 写道:
On Mon, Feb 24, 2025 at 03:03:18PM +0800, Yu Kuai wrote:
Hi, Ming!

在 2025/02/24 11:28, Ming Lei 写道:
throtl_trim_slice() returns immediately if throtl_slice_used()
is true.

And throtl_slice_used() checks jiffies in [start, end] via time_in_range(),
so if `start <= jiffies <= end', it still returns false.

Yes, I misread the code, by thinking throtl_slice_used() will return
true if the slice is still used. :(


BTW, throtl_trim_slice() looks like problematic:

-       if (bytes_trim <= 0 && io_trim <= 0)
+       if (bytes_trim <= 0 || io_trim <= 0 ||
+           tg->bytes_disp[rw] < bytes_trim || tg->io_disp[rw] < io_trim)
                  return;
That is exactly what my patch is doing, just taking deviation and
timeout into account, also U64_MAX limit has to be excluded.
Yes, perhaps you can add some comments in the last two conditions of
your patch.

Yes, we need to add comment on the check, how about the following words?

```

If actually rate doesn't match with expected rate, do not trim slice
otherwise the present rate control info is lost, we don't have chance
to compensate it in the following period of this slice any more.

So, I just give your patch a test, and result is 1.3s while 1s is
expected. While debuging, a new idea come up in mind. :)

How about keep at least one slice out of consideration from
throtl_trim_slice()? With following patch, the result is between
1.01-1.03s in my VM.

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8d149aff9fd0..5207c85098a5 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -604,9 +604,12 @@ static inline void throtl_trim_slice(struct 
throtl_grp *tg, bool rw)

        time_elapsed = rounddown(jiffies - tg->slice_start[rw],
                                 tg->td->throtl_slice);
-       if (!time_elapsed)
+       /* don't trim slice until at least 2 slice is used */
+       if (time_elapsed < tg->td->throtl_slice * 2)
                return;

+       /* dispite the last slice, trim previous slice */
+       time_elapsed -= tg->td->throtl_slice;
        bytes_trim = calculate_bytes_allowed(tg_bps_limit(tg, rw),
                                             time_elapsed) +
                     tg->carryover_bytes[rw];


Add one deviation threshold since bio size is usually not divisible by
present rate, and bio dispatch should be done or nothing

Also limit max slice size for avoiding to extend the window too big.

```


I think maybe you're concerned about the case IO is
throttled by IOs and bytes_disp should really erase extra bytes that
does not reach bps_limit.


If you test the patch, it works just fine. My patch controls bytes_exp
<= 1.5 * disp, then throtl/001 can be completed in 1.5sec, and if it is
changed to 1.25 * disp, the test can be completed in 1.25sec.

With this fix, why do we have to play the complicated carryover
trick?


I understand your code now. carryover_bytes in this case is wrong, as
long as new slice is not started, and trim slice doesn't erase extra
bytes by mistake, throttle time should be accurate over time because
bytes_disp is accurate.

Yes.

More or less wait for handling single bio can just affect instantaneous rate,
but the algorithm is adaptive so it can adjust the following wait if the
slice isn't over.


And root cause really is throtl_trim_slice().

However, by code review, I think throtl_start_new_slice() should be
handled as well, the same as throtl_trim_slice(), if the old bio is
dispatched with one more jiffies wait time, issue a new bio in the same
jiffies will have the same problem. For example:

throtl_start_new_slice() is called when nothing is queued and the current
slice is used, so probably it is fine.

throtl_start_new_slice_with_credit() is called when dispatching one
bio, and it is still called when the current slice is used.


Caller do sync IO, with io size: (bps_limit / (HZ / throtl_slice) + 1),

This will cause wait for every single IO.

But once the IO is throttled because of the wait, throtl_start_new_slice()
won't succeed any more, meantime throtl_trim_slice() will try to fix the
rate control for the extra 1 jiffies.

So in-tree code with trim fix works just fine if consecutive IOs are
coming.

and caller will issue new IO when old IO is done. Then in this case,
each IO will start a new slice and wait for throtl_slice + 1 jiffies. I
believe this can be fixed as well so that BIOs after the first one will
only wait for throtl_silce jiffies.

I guess the above case only exists in theory when you submit new IO
after the old IO is dispatched immediately. Not sure this case is really
meaningful because submission period/latency is added/faked by user.

Yes, this case is just a theory, we don't need to care for now. :)

Thanks,
Kuai



Thanks,
Ming


.