Hi fsdevel, I wanted to get your opinion on the following scenario where we are running into disk/iops perf issues because the dio kernel worker thread is being cpu starved. Please also let me know if you think there is a better place to ask this than this mailing list. Originally we stumbled upon this when running Redpanda in kubernetes but the problem is fairly reproducible by using fio. k8s creates a cgroup (hierarchy) in which it spawns its pods. On an N-core system it assigns a cgroup cpu.weight of N times the default weight to the root of the cgroup. Hence tasks running in that cgroup get about N times more runtime than other tasks (including kernel threads). In a scenario when using direct io this can cause performance issues as the dio thread that handles the dio completions gets excessively preempted and hence effectively falls behind. Outside of a k8s environment we can reproduce it like the following. Setup a cgroup (assuming a cgroups v2 system): -- cgcreate -g cpu:/kubepods cgset -r cpu.weight=1000 kubepods -- Now compare fio running outside of the cgroup and inside it. Test system is: - Amazon Linux 2023 / 6.1 Linux - i3en.3xlarge instance / 200k IOPS@4K write - XFS filesystem Outside / good: -- taskset -c 11 fio --name=write_iops --directory=/mnt/xfs --size=10G --time_based --runtime=1m --ramp_time=10s \ --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=128 --rw=randwrite --group_reporting=1 --iodepth_batch_submit=128 --iodepth_batch_complete_max=128 ... iops : min=200338, max=200944, avg=200570.37, stdev=60.93, samples=120 ... -- We see that we reach the full 200k IOPS. Now compare to running inside the cgroup: -- cgexec -g cpu:kubepods -- taskset -c 11 fio --name=write_iops --directory=/mnt/xfs --size=10G --time_based --runtime=1m \ --ramp_time=10s --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=128 --rw=randwrite --group_reporting=1 --iodepth_batch_submit=128 --iodepth_batch_complete_max=128 ... iops : min=113589, max=120554, avg=116073.72, stdev=1334.46, samples=120 ... -- As you can see we have lost almost 50% of our IOPS. Comparing cpu time and context switches we see that in the bad case we are context switching a lot more. Overall the core is running at 100% in the bad case while only at something like 50% in the good case. no-cgroup: task clock of fio: -- perf stat -e task-clock -p 27393 -- sleep 1 Performance counter stats for process id '27393': 442.62 msec task-clock # 0.442 CPUs utilized 1.002110208 seconds time elapsed -- no cgroup: context switches on that core: -- perf stat -e context-switches -C 11 -- sleep 1 Performance counter stats for 'CPU(s) 11': 103001 context-switches 1.001048841 seconds time elapsed -- Using the cgroup: task clock of fio: -- perf stat -e task-clock -p 27456 -- sleep 1 Performance counter stats for process id '27456': 695.30 msec task-clock # 0.695 CPUs utilized 1.001112431 seconds time elapsed -- Using the cgroup: context switches on that core: -- perf stat -e context-switches -C 11 -- sleep 1 Performance counter stats for 'CPU(s) 11': 243755 context-switches 1.001096517 seconds time elapsed -- So we are doing about 2.5x more context switches in the bad case. Doing the math at about ~120k IOPS we see that for every IOP we are doing two interrupts (in and out). Finally we can also look at some perf sched traces to get an idea for what is happening (sched_stat_runtime calls omitted): The general pattern in the good case seems to be: -- fio 28143 [011] 2038.648954: sched:sched_waking: comm=kworker/11:68 pid=27489 prio=120 target_cpu=011 ffffffff9f0d7ba3 try_to_wake_up+0x2b3 ([kernel.kallsyms]) ffffffff9f0d7ba3 try_to_wake_up+0x2b3 ([kernel.kallsyms]) ffffffff9f0b91d5 __queue_work+0x1d5 ([kernel.kallsyms]) ffffffff9f0b93a4 queue_work_on+0x24 ([kernel.kallsyms]) ffffffff9f3bb04c iomap_dio_bio_end_io+0x8c ([kernel.kallsyms]) ffffffff9f53749d blk_mq_end_request_batch+0xfd ([kernel.kallsyms]) ffffffff9f7198df nvme_irq+0x7f ([kernel.kallsyms]) ffffffff9f113956 __handle_irq_event_percpu+0x46 ([kernel.kallsyms]) ffffffff9f113b14 handle_irq_event+0x34 ([kernel.kallsyms]) ffffffff9f118257 handle_edge_irq+0x87 ([kernel.kallsyms]) ffffffff9f033eee __common_interrupt+0x3e ([kernel.kallsyms]) ffffffff9fa023ab common_interrupt+0x7b ([kernel.kallsyms]) ffffffff9fc00da2 asm_common_interrupt+0x22 ([kernel.kallsyms]) ffffffff9f297a4b internal_get_user_pages_fast+0x10b ([kernel.kallsyms]) ffffffff9f591bdb __iov_iter_get_pages_alloc+0xdb ([kernel.kallsyms]) ffffffff9f591ef9 iov_iter_get_pages2+0x19 ([kernel.kallsyms]) ffffffff9f5269af __bio_iov_iter_get_pages+0x5f ([kernel.kallsyms]) ffffffff9f526d6d bio_iov_iter_get_pages+0x1d ([kernel.kallsyms]) ffffffff9f3ba578 iomap_dio_bio_iter+0x288 ([kernel.kallsyms]) ffffffff9f3bab72 __iomap_dio_rw+0x3e2 ([kernel.kallsyms]) ffffffff9f3baf8e iomap_dio_rw+0xe ([kernel.kallsyms]) ffffffff9f45ff58 xfs_file_dio_write_aligned+0x98 ([kernel.kallsyms]) ffffffff9f460644 xfs_file_write_iter+0xc4 ([kernel.kallsyms]) ffffffff9f39c876 aio_write+0x116 ([kernel.kallsyms]) ffffffff9f3a034e io_submit_one+0xde ([kernel.kallsyms]) ffffffff9f3a0960 __x64_sys_io_submit+0x80 ([kernel.kallsyms]) ffffffff9fa01135 do_syscall_64+0x35 ([kernel.kallsyms]) ffffffff9fc00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms]) 3ee5d syscall+0x1d (/usr/lib64/libc.so.6) 2500000025 [unknown] ([unknown]) fio 28143 [011] 2038.648974: sched:sched_switch: prev_comm=fio prev_pid=28143 prev_prio=120 prev_state=R ==> next_comm=kworker/11:68 next_pid=27489 next_prio=120 ffffffff9fa0d002 __schedule+0x282 ([kernel.kallsyms]) ffffffff9fa0d002 __schedule+0x282 ([kernel.kallsyms]) ffffffff9fa0d3aa schedule+0x5a ([kernel.kallsyms]) ffffffff9f135a36 exit_to_user_mode_prepare+0xa6 ([kernel.kallsyms]) ffffffff9fa050fd syscall_exit_to_user_mode+0x1d ([kernel.kallsyms]) ffffffff9fa01142 do_syscall_64+0x42 ([kernel.kallsyms]) ffffffff9fc00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms]) 3ee5d syscall+0x1d (/usr/lib64/libc.so.6) 2500000025 [unknown] ([unknown]) kworker/11:68-d 27489 [011] 2038.648984: sched:sched_switch: prev_comm=kworker/11:68 prev_pid=27489 prev_prio=120 prev_state=I ==> next_comm=fio next_pid=28143 next_prio=120 ffffffff9fa0d002 __schedule+0x282 ([kernel.kallsyms]) ffffffff9fa0d002 __schedule+0x282 ([kernel.kallsyms]) ffffffff9fa0d3aa schedule+0x5a ([kernel.kallsyms]) ffffffff9f0ba249 worker_thread+0xb9 ([kernel.kallsyms]) ffffffff9f0c1559 kthread+0xd9 ([kernel.kallsyms]) ffffffff9f001e02 ret_from_fork+0x22 ([kernel.kallsyms]) -- fio is busy submitting aio events and gets interrupted from the nvme interrupts at which point control is yielded to the dio thread which handles the completion and yields back to fio. Looking at the bad case there now seems to be some form of ping pong: -- fio 28517 [011] 2702.018634: sched:sched_switch: prev_comm=fio prev_pid=28517 prev_prio=120 prev_state=S ==> next_comm=kworker/11:68 next_pid=27489 next_prio=120 ffffffff9fa0d002 __schedule+0x282 ([kernel.kallsyms]) ffffffff9fa0d002 __schedule+0x282 ([kernel.kallsyms]) ffffffff9fa0d3aa schedule+0x5a ([kernel.kallsyms]) ffffffff9f39de89 read_events+0x119 ([kernel.kallsyms]) ffffffff9f39e042 do_io_getevents+0x72 ([kernel.kallsyms]) ffffffff9f39e689 __x64_sys_io_getevents+0x59 ([kernel.kallsyms]) ffffffff9fa01135 do_syscall_64+0x35 ([kernel.kallsyms]) ffffffff9fc00126 entry_SYSCALL_64_after_hwframe+0x6e ([kernel.kallsyms]) 3ee5d syscall+0x1d (/usr/lib64/libc.so.6) 11300000113 [unknown] ([unknown]) kworker/11:68+d 27489 [011] 2702.018639: sched:sched_waking: comm=fio pid=28517 prio=120 target_cpu=011 ffffffff9f0d7ba3 try_to_wake_up+0x2b3 ([kernel.kallsyms]) ffffffff9f0d7ba3 try_to_wake_up+0x2b3 ([kernel.kallsyms]) ffffffff9f0fa9d1 autoremove_wake_function+0x11 ([kernel.kallsyms]) ffffffff9f0fbb90 __wake_up_common+0x80 ([kernel.kallsyms]) ffffffff9f0fbd23 __wake_up_common_lock+0x83 ([kernel.kallsyms]) ffffffff9f39f9df aio_complete_rw+0xef ([kernel.kallsyms]) ffffffff9f0b9c35 process_one_work+0x1e5 ([kernel.kallsyms]) ffffffff9f0ba1e0 worker_thread+0x50 ([kernel.kallsyms]) ffffffff9f0c1559 kthread+0xd9 ([kernel.kallsyms]) ffffffff9f001e02 ret_from_fork+0x22 ([kernel.kallsyms]) kworker/11:68+d 27489 [011] 2702.018642: sched:sched_switch: prev_comm=kworker/11:68 prev_pid=27489 prev_prio=120 prev_state=R+ ==> next_comm=fio next_pid=28517 next_prio=120 ffffffff9fa0d002 __schedule+0x282 ([kernel.kallsyms]) ffffffff9fa0d002 __schedule+0x282 ([kernel.kallsyms]) ffffffff9fa0d4cb preempt_schedule_common+0x1b ([kernel.kallsyms]) ffffffff9fa0d51c __cond_resched+0x1c ([kernel.kallsyms]) ffffffff9f0b9c56 process_one_work+0x206 ([kernel.kallsyms]) ffffffff9f0ba1e0 worker_thread+0x50 ([kernel.kallsyms]) ffffffff9f0c1559 kthread+0xd9 ([kernel.kallsyms]) ffffffff9f001e02 ret_from_fork+0x22 ([kernel.kallsyms]) -- fio is sleeping in io_getevents waiting for all events to complete. The dio worker thread gets scheduled in handling aio completions one by one. This allows fio to wake as there are some amount of completions ready for it to process. Now because of the high weight of the fio process the kernel worker only gets a short amount of runtime and gets preempted by the scheduler yielding back to fio (notice the stack and R+ in the above trace). However because fio is waiting for all aios to complete it wakes up and goes straight back to sleep again. This ping pong continues. Note that because we run with `--iodepth_batch_submit=128 --iodepth_batch_complete_max=128` io_getevents won't actually return to userspace until all 128 events have completed. However even if we were to return after just one event (i.e.: calling io_getevents with min_nr=1) that might just make it worse as the application spin loop around io_getevents is likely even more expensive. The issue can be avoided by renicing the dio kernel worker thread to a higher priority. However from my understanding there is no way to do this reliably as these worker threads are ephemeral? I am wondering whether you have any thoughts on the above and/or can think of any workarounds that we could apply? Note I did also test this on Ubuntu 24.04 / Linux 6.8 which seems to behave a lot better. I am only seeing a 1.2-1.3x regression in the amount of context switches in the bad case. I assume the new EEVDF scheduler handles this scenario better? So this might be a solution but it will obviously take a while until that kernel version is more widespread in production deployments. Possibly also commit 71eb6b6b0ba93b1467bccff57b5de746b09113d2 (fs/aio: obey min_nr when doing wakeups) is helping here but as described previously avoiding the wakupe if min_nr isn't reached doesn't help in more realistic scenarios where we are polling with min_nr=1 anyway. Thanks, Stephan