Hi,
在 2024/08/13 14:39, Yu Kuai 写道:
Hi,
在 2024/08/13 13:00, Lance Yang 写道:
Hi Kuai,
Thanks a lot for jumping in!
On Tue, Aug 13, 2024 at 9:37 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
Hi,
在 2024/08/12 23:43, Michal Koutný 写道:
+Cc Kuai
On Mon, Aug 12, 2024 at 11:00:30PM GMT, Lance Yang
<ioworker0@xxxxxxxxx> wrote:
Hi all,
I've run into a problem with Cgroup v2 where it doesn't seem to
correctly limit
I/O operations when I set both wbps and wiops for a device.
However, if I only
set wbps, then everything works as expected.
To reproduce the problem, we can follow these command-based steps:
1. **System Information:**
- Kernel Version and OS Release:
```
$ uname -r
6.10.0-rc5+
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
```
2. **Device Information and Settings:**
- List Block Devices and Scheduler:
```
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 4.4T 0 disk
└─sda1 8:1 0 4.4T 0 part /data
...
$ cat /sys/block/sda/queue/scheduler
none [mq-deadline] kyber bfq
$ cat /sys/block/sda/queue/rotational
1
```
3. **Reproducing the problem:**
- Navigate to the cgroup v2 filesystem and configure I/O
settings:
```
$ cd /sys/fs/cgroup/
$ stat -fc %T /sys/fs/cgroup
cgroup2fs
$ mkdir test
$ echo "8:0 wbps=10485760 wiops=100000" > io.max
```
In this setup:
wbps=10485760 sets the write bytes per second limit to 10 MB/s.
wiops=100000 sets the write I/O operations per second limit
to 100,000.
- Add process to the cgroup and verify:
```
$ echo $$ > cgroup.procs
$ cat cgroup.procs
3826771
3828513
$ ps -ef|grep 3826771
root 3826771 3826768 0 22:04 pts/1 00:00:00 -bash
root 3828761 3826771 0 22:06 pts/1 00:00:00 ps -ef
root 3828762 3826771 0 22:06 pts/1 00:00:00 grep
--color=auto 3826771
```
- Observe I/O performance using `dd` commands and `iostat`:
```
$ dd if=/dev/zero of=/data/file1 bs=512M count=1 &
$ dd if=/dev/zero of=/data/file1 bs=512M count=1 &
You're testing buffer IO here, and I don't see that write back cgroup is
enabled. Is this test intentional? Why not test direct IO?
Yes, I was testing buffered I/O and can confirm that
CONFIG_CGROUP_WRITEBACK
was enabled.
$ cat /boot/config-6.10.0-rc5+ |grep CONFIG_CGROUP_WRITEBACK
CONFIG_CGROUP_WRITEBACK=y
We intend to configure both wbps (write bytes per second) and wiops
(write I/O operations
per second) for the containers. IIUC, this setup will effectively
restrict both their block device
I/Os and buffered I/Os.
Why not test direct IO?
I was testing direct IO as well. However it did not work as expected with
`echo "8:0 wbps=10485760 wiops=100000" > io.max`.
$ time dd if=/dev/zero of=/data/file7 bs=512M count=1 oflag=direct
So, you're issuing one huge IO, with 512M.
1+0 records in
1+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 51.5962 s, 10.4 MB/s
And this result looks correct. Please noted that blk-throtl works before
IO submit, while iostat reports IO that are done. A huge IO can be
throttled for a long time.
real 0m51.637s
user 0m0.000s
sys 0m0.313s
$ iostat -d 1 -h -y -p sda
tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn
kB_dscd Device
9.00 0.0k 1.3M 0.0k 0.0k 1.3M
0.0k sda
9.00 0.0k 1.3M 0.0k 0.0k 1.3M
0.0k sda1
I don't understand yet is why there are few IO during the wait. Can you
test for a raw disk to bypass filesystem?
To be updated, I add a debug patch for this:
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index dc6140fa3de0..3b2648c17079 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1119,8 +1119,10 @@ static void blk_throtl_dispatch_work_fn(struct
work_struct *work)
if (!bio_list_empty(&bio_list_on_stack)) {
blk_start_plug(&plug);
- while ((bio = bio_list_pop(&bio_list_on_stack)))
+ while ((bio = bio_list_pop(&bio_list_on_stack))) {
+ printk("%s: bio done %lu %px\n", __func__,
bio_sectors(bio), bio);
submit_bio_noacct_nocheck(bio);
+ }
blk_finish_plug(&plug);
}
}
@@ -1606,6 +1608,8 @@ bool __blk_throtl_bio(struct bio *bio)
bool throttled = false;
struct throtl_data *td = tg->td;
+ printk("%s: bio start %lu %px\n", __func__, bio_sectors(bio), bio);
+
rcu_read_lock();
spin_lock_irq(&q->queue_lock);
sq = &tg->service_queue;
@@ -1649,6 +1653,7 @@ bool __blk_throtl_bio(struct bio *bio)
tg = sq_to_tg(sq);
if (!tg) {
bio_set_flag(bio, BIO_BPS_THROTTLED);
+ printk("%s: bio done %lu %px\n", __func__,
bio_sectors(bio), bio);
goto out_unlock;
}
}
For dirct IO with raw disk:
with or without wiops, the result is the same:
[ 469.736098] __blk_throtl_bio: bio start 2128 ffff8881014c08c0
[ 469.736903] __blk_throtl_bio: bio start 2144 ffff88817852ec80
[ 469.737585] __blk_throtl_bio: bio start 2096 ffff88817852f080
[ 469.738392] __blk_throtl_bio: bio start 2096 ffff88817852f480
[ 469.739358] __blk_throtl_bio: bio start 2064 ffff88817852e880
[ 469.740330] __blk_throtl_bio: bio start 2112 ffff88817852fa80
[ 469.741262] __blk_throtl_bio: bio start 2080 ffff88817852e280
[ 469.742280] __blk_throtl_bio: bio start 2096 ffff88817852e080
[ 469.743281] __blk_throtl_bio: bio start 2104 ffff88817852f880
[ 469.744309] __blk_throtl_bio: bio start 2240 ffff88817852e680
[ 469.745050] __blk_throtl_bio: bio start 2184 ffff88817852e480
[ 469.745857] __blk_throtl_bio: bio start 2120 ffff88817852f680
[ 469.746779] __blk_throtl_bio: bio start 2512 ffff88817852fe80
[ 469.747611] __blk_throtl_bio: bio start 2488 ffff88817852f280
[ 469.748242] __blk_throtl_bio: bio start 2120 ffff88817852ee80
[ 469.749159] __blk_throtl_bio: bio start 2256 ffff88817852fc80
[ 469.750087] __blk_throtl_bio: bio start 2576 ffff88817852ea80
[ 469.750802] __blk_throtl_bio: bio start 2112 ffff8881014a3a80
[ 469.751586] __blk_throtl_bio: bio start 2240 ffff8881014a2880
[ 469.752383] __blk_throtl_bio: bio start 2160 ffff8881014a2e80
[ 469.753289] __blk_throtl_bio: bio start 2248 ffff8881014a3c80
[ 469.754024] __blk_throtl_bio: bio start 2536 ffff8881014a2680
[ 469.754913] __blk_throtl_bio: bio start 2088 ffff8881014a3080
[ 469.766036] __blk_throtl_bio: bio start 211344 ffff8881014a3280
[ 469.842366] blk_throtl_dispatch_work_fn: bio done 2128 ffff8881014c08c0
[ 469.952627] blk_throtl_dispatch_work_fn: bio done 2144 ffff88817852ec80
[ 470.048729] blk_throtl_dispatch_work_fn: bio done 2096 ffff88817852f080
[ 470.152642] blk_throtl_dispatch_work_fn: bio done 2096 ffff88817852f480
[ 470.256661] blk_throtl_dispatch_work_fn: bio done 2064 ffff88817852e880
[ 470.360662] blk_throtl_dispatch_work_fn: bio done 2112 ffff88817852fa80
[ 470.464626] blk_throtl_dispatch_work_fn: bio done 2080 ffff88817852e280
[ 470.568652] blk_throtl_dispatch_work_fn: bio done 2096 ffff88817852e080
[ 470.672623] blk_throtl_dispatch_work_fn: bio done 2104 ffff88817852f880
[ 470.776620] blk_throtl_dispatch_work_fn: bio done 2240 ffff88817852e680
[ 470.889801] blk_throtl_dispatch_work_fn: bio done 2184 ffff88817852e480
[ 470.992686] blk_throtl_dispatch_work_fn: bio done 2120 ffff88817852f680
[ 471.112633] blk_throtl_dispatch_work_fn: bio done 2512 ffff88817852fe80
[ 471.232680] blk_throtl_dispatch_work_fn: bio done 2488 ffff88817852f280
[ 471.336695] blk_throtl_dispatch_work_fn: bio done 2120 ffff88817852ee80
[ 471.448645] blk_throtl_dispatch_work_fn: bio done 2256 ffff88817852fc80
[ 471.576632] blk_throtl_dispatch_work_fn: bio done 2576 ffff88817852ea80
[ 471.680709] blk_throtl_dispatch_work_fn: bio done 2112 ffff8881014a3a80
[ 471.792680] blk_throtl_dispatch_work_fn: bio done 2240 ffff8881014a2880
[ 471.896682] blk_throtl_dispatch_work_fn: bio done 2160 ffff8881014a2e80
[ 472.008698] blk_throtl_dispatch_work_fn: bio done 2248 ffff8881014a3c80
[ 472.136630] blk_throtl_dispatch_work_fn: bio done 2536 ffff8881014a2680
[ 472.240678] blk_throtl_dispatch_work_fn: bio done 2088 ffff8881014a3080
[ 482.560633] blk_throtl_dispatch_work_fn: bio done 211344 ffff8881014a3280
Hence the upper layer issue some small IO first, then with a 100+MB IO,
and wait time looks correct.
Then, I retest for xfs, result are still the same with or without wiops:
[ 1175.907019] __blk_throtl_bio: bio start 8192 ffff88816daf8480
[ 1175.908224] __blk_throtl_bio: bio start 8192 ffff88816daf8e80
[ 1175.910618] __blk_throtl_bio: bio start 8192 ffff88816daf9280
[ 1175.911991] __blk_throtl_bio: bio start 8192 ffff88816daf8280
[ 1175.913187] __blk_throtl_bio: bio start 8192 ffff88816daf9080
[ 1175.914904] __blk_throtl_bio: bio start 8192 ffff88816daf9680
[ 1175.916099] __blk_throtl_bio: bio start 8192 ffff88816daf8880
[ 1175.917844] __blk_throtl_bio: bio start 8192 ffff88816daf8c80
[ 1175.919025] __blk_throtl_bio: bio start 8192 ffff88816daf8a80
[ 1175.920868] __blk_throtl_bio: bio start 8192 ffff888178a84080
[ 1175.922068] __blk_throtl_bio: bio start 8192 ffff888178a84280
[ 1175.923819] __blk_throtl_bio: bio start 8192 ffff888178a84480
[ 1175.925017] __blk_throtl_bio: bio start 8192 ffff888178a84680
[ 1175.926851] __blk_throtl_bio: bio start 8192 ffff888178a84880
[ 1175.928025] __blk_throtl_bio: bio start 8192 ffff888178a84a80
[ 1175.929806] __blk_throtl_bio: bio start 8192 ffff888178a84c80
[ 1175.931007] __blk_throtl_bio: bio start 8192 ffff888178a84e80
[ 1175.932852] __blk_throtl_bio: bio start 8192 ffff888178a85080
[ 1175.934041] __blk_throtl_bio: bio start 8192 ffff888178a85280
[ 1175.935892] __blk_throtl_bio: bio start 8192 ffff888178a85480
[ 1175.937074] __blk_throtl_bio: bio start 8192 ffff888178a85680
[ 1175.938860] __blk_throtl_bio: bio start 8192 ffff888178a85880
[ 1175.940053] __blk_throtl_bio: bio start 8192 ffff888178a85a80
[ 1175.941824] __blk_throtl_bio: bio start 8192 ffff888178a85c80
[ 1175.943040] __blk_throtl_bio: bio start 8192 ffff888178a85e80
[ 1175.944945] __blk_throtl_bio: bio start 8192 ffff88816b046080
[ 1175.946156] __blk_throtl_bio: bio start 8192 ffff88816b046280
[ 1175.948261] __blk_throtl_bio: bio start 8192 ffff88816b046480
[ 1175.949521] __blk_throtl_bio: bio start 8192 ffff88816b046680
[ 1175.950877] __blk_throtl_bio: bio start 8192 ffff88816b046880
[ 1175.952051] __blk_throtl_bio: bio start 8192 ffff88816b046a80
[ 1175.954313] __blk_throtl_bio: bio start 8192 ffff88816b046c80
[ 1175.955530] __blk_throtl_bio: bio start 8192 ffff88816b046e80
[ 1175.957370] __blk_throtl_bio: bio start 8192 ffff88816b047080
[ 1175.958818] __blk_throtl_bio: bio start 8192 ffff88816b047280
[ 1175.960093] __blk_throtl_bio: bio start 8192 ffff88816b047480
[ 1175.961900] __blk_throtl_bio: bio start 8192 ffff88816b047680
[ 1175.963070] __blk_throtl_bio: bio start 8192 ffff88816b047880
[ 1175.965262] __blk_throtl_bio: bio start 8192 ffff88816b047a80
[ 1175.966527] __blk_throtl_bio: bio start 8192 ffff88816b047c80
[ 1175.967928] __blk_throtl_bio: bio start 8192 ffff88816b047e80
[ 1175.969124] __blk_throtl_bio: bio start 8192 ffff888170e84080
[ 1175.971369] __blk_throtl_bio: bio start 8192 ffff888170e84280
Hence xfs is always issuing 4MB IO, that's whay stable wbps can be
observed by iostat. The main difference is that a 100+MB IO is issued
from the last test and throttle for about 10+s.
Then for your case, you might want to comfirm what kind of IO are
submitted from upper layer.
Thanks,
Kuai
Thanks,
Kuai
.