On 1/3/23 03:03, Pavel Begunkov wrote:
The series replaces waitqueues for CQ waiting with a custom waiting loop and adds a couple more perf tweak around it. Benchmarking is done for QD1 with simulated tw arrival right after we start waiting, it gets us from 7.5 MIOPS to 9.2, which is +22%, or double the number for the in-kernel io_uring overhead (i.e. without syscall and userspace). That matches profiles, wake_up() _without_ wake_up_state() was taking 12-14% and prepare_to_wait_exclusive() was around 4-6%.
The numbers are gathered with an in-kernel trick. Tried to quickly measure without it: modprobe null_blk no_sched=1 irqmode=2 completion_nsec=0 taskset -c 0 fio/t/io_uring -d1 -s1 -c1 -p0 -B1 -F1 -X -b512 -n4 /dev/nullb0 The important part here is using timers-backed nullblk and pinning multiple workers to a single CPU. -n4 was enough for me to keep the CPU 100% busy. old: IOPS=539.51K, BW=2.11GiB/s, IOS/call=1/1 IOPS=542.26K, BW=2.12GiB/s, IOS/call=1/1 IOPS=540.73K, BW=2.11GiB/s, IOS/call=1/1 IOPS=541.28K, BW=2.11GiB/s, IOS/call=0/0 new: IOPS=561.85K, BW=2.19GiB/s, IOS/call=1/1 IOPS=561.58K, BW=2.19GiB/s, IOS/call=1/1 IOPS=561.56K, BW=2.19GiB/s, IOS/call=1/1 IOPS=559.94K, BW=2.19GiB/s, IOS/call=1/1 The different is only ~3.5% because of huge additional overhead for nullb timers, block qos and other unnecessary bits. P.S. tested with an out-of-tree patch adding a flag enabling/disabling the feature to remove variance b/w reboots. -- Pavel Begunkov