Re: [RFC v2 00/13] CQ waiting and wake up optimisations

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 4 Jan 2023 20:25:20 +0000

On 1/3/23 03:03, Pavel Begunkov wrote:
The series replaces waitqueues for CQ waiting with a custom waiting
loop and adds a couple more perf tweak around it. Benchmarking is done
for QD1 with simulated tw arrival right after we start waiting, it
gets us from 7.5 MIOPS to 9.2, which is +22%, or double the number for
the in-kernel io_uring overhead (i.e. without syscall and userspace).
That matches profiles, wake_up() _without_ wake_up_state() was taking
12-14% and prepare_to_wait_exclusive() was around 4-6%.

The numbers are gathered with an in-kernel trick. Tried to quickly
measure without it:

modprobe null_blk no_sched=1 irqmode=2 completion_nsec=0
taskset -c 0 fio/t/io_uring -d1 -s1 -c1 -p0 -B1 -F1 -X -b512 -n4 /dev/nullb0

The important part here is using timers-backed nullblk and pinning
multiple workers to a single CPU. -n4 was enough for me to keep
the CPU 100% busy.

old:
IOPS=539.51K, BW=2.11GiB/s, IOS/call=1/1
IOPS=542.26K, BW=2.12GiB/s, IOS/call=1/1
IOPS=540.73K, BW=2.11GiB/s, IOS/call=1/1
IOPS=541.28K, BW=2.11GiB/s, IOS/call=0/0

new:
IOPS=561.85K, BW=2.19GiB/s, IOS/call=1/1
IOPS=561.58K, BW=2.19GiB/s, IOS/call=1/1
IOPS=561.56K, BW=2.19GiB/s, IOS/call=1/1
IOPS=559.94K, BW=2.19GiB/s, IOS/call=1/1

The different is only ~3.5%  because of huge additional overhead
for nullb timers, block qos and other unnecessary bits.

P.S. tested with an out-of-tree patch adding a flag enabling/disabling
the feature to remove variance b/w reboots.

--
Pavel Begunkov