在 2022/12/22 下午9:39, Michal Koutný 写道:
Hello Jinke.
On Wed, Dec 21, 2022 at 06:42:46PM +0800, Jinke Han <hanjinke.666@xxxxxxxxxxxxx> wrote:
In our test, fio writes a 100g file in sequential 4k blocksize in
a container with low bps limit configured (wbps=10M). More than 1200
ios were throttled in blk-throtl queue and the avarage throtle time
of each io is 140s. At the same time, the operation of saving a small
file by vim will be blocked amolst 140s. As a fsync will be send by vim,
the sync ios of fsync will be blocked by a huge amount of buffer write
ios ahead. This is also a priority inversion problem within one cgroup.
In the database scene, things got really bad with blk-throtle enabled
as fsync is called very often.
I'm trying to make sense of the numbers:
- at 10 MB/s, it's 0.4 ms per 4k block
- there are 1.2k throttled bios that gives waiting time of roughly 0.5s
~ 0.4ms * 1200
- you say that you observe 280 times longer throttling time,
- that'd mean there should be 340k queued bios
- or cummulative dispatch of ~1400 MB of data
Hi
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 0.00 11.00 0.00 8.01 0.00 0.00
0.00 0.00 0.00 7.18 0.08 0.00 745.45 3.27 3.60
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 0.00 8.00 0.00 9.14 0.00 0.00
0.00 0.00 0.00 7.38 0.06 0.00 1170.00 2.62 2.10
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 0.00 16.00 0.00 12.02 0.00 12.00
0.00 42.86 0.00 7.25 0.12 0.00 769.25 2.06 3.30
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 0.00 11.00 0.00 10.91 0.00 1.00
0.00 8.33 0.00 6.82 0.07 0.00 1015.64 2.36 2.60
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sdb 0.00 11.00 0.00 9.14 0.00 1.00
0.00 8.33 0.00 6.27 0.07 0.00 850.91 2.55 2.80
I used bcc to trace the time of bio form submit_bio to blk_mq_submit_bio
and found the avarage time was nearly 140s(use bcc trace fsync duration
also get the same result).
The iostat above seem the avaerage of each io nearly 1M, so I have rough
estimate the num of the bio queued is 140s * 10 m / 1m.
So what are the queued quantities? Are there more than 1200 bios or are
they bigger than the 4k you mention?
"fio writes a 100g file in sequential 4k blocksize"
Bios may be more than 1M as ext4 will merged continuously logic blocks
when physical block also continuously.
Thanks for clarification.
(I acknowledge the possible problem with a large population of async
writes delaying scarce sync writes.)
Michal
If the 0.4ms oberved by iostat, the way to estimate the throtle time of
the bio by 0.4ms * 1200 may not work as the 0.4 is duration of the
request from alloc to done.
If the average size of bio is 1m, dispatch one bio should cost 1m/ 10M =
100ms. The queue is fifo, so the average throtle time 100ms * 1400.
Thanks.