On 10/19/21 22:24, Pavel Begunkov wrote:
Jens tried out a similar series with some not yet sent additions: 8.2-8.3 MIOPS -> ~9 MIOPS, or 8-10%. 12/16 is bulky, but it nicely drives the numbers. Moreover, with it we can rid of some not used anymore optimisations in __blkdev_direct_IO() because it awlays serve multiple bios. E.g. no need in conditional referencing with DIO_MULTI_BIO, and _probably_ can be converted to chained bio.
Some numbers, using nullblk is not perfect, but empirically from numbers Jens posts his Optane setup usually gives somewhat relatable results in terms of % difference. (probably, divide the difference in percents by 2 for the worst case). modprobe null_blk no_sched=1 irqmode=1 completion_nsec=0 submit_queues=16 poll_queues=32 echo 0 > /sys/block/nullb0/queue/iostats echo 2 > /sys/block/nullb0/queue/nomerges nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p1 -B1 -F1 -b512 /dev/nullb0 # polled=1, fixedbufs=1, register_files=1, buffered=0 QD=32, sq_ring=32, cq_ring=64 # baseline (for-5.16/block) IOPS=4304768, IOS/call=32/32, inflight=32 (32) IOPS=4289824, IOS/call=32/32, inflight=32 (32) IOPS=4227808, IOS/call=32/32, inflight=32 (32) IOPS=4187008, IOS/call=32/32, inflight=32 (32) IOPS=4196992, IOS/call=32/32, inflight=32 (32) IOPS=4208384, IOS/call=32/32, inflight=32 (32) IOPS=4233888, IOS/call=32/32, inflight=32 (32) IOPS=4266432, IOS/call=32/32, inflight=32 (32) IOPS=4232352, IOS/call=32/32, inflight=32 (32) # + patch 14/16 (skip advance) IOPS=4367424, IOS/call=32/32, inflight=0 (16) IOPS=4401088, IOS/call=32/32, inflight=32 (32) IOPS=4400544, IOS/call=32/32, inflight=0 (29) IOPS=4400768, IOS/call=32/32, inflight=32 (32) IOPS=4409568, IOS/call=32/32, inflight=32 (32) IOPS=4373888, IOS/call=32/32, inflight=32 (32) IOPS=4392544, IOS/call=32/32, inflight=32 (32) IOPS=4368192, IOS/call=32/32, inflight=32 (32) IOPS=4362976, IOS/call=32/32, inflight=32 (32) Comparing profiling. Before: + 1.75% io_uring [kernel.vmlinux] [k] bio_iov_iter_get_pages + 0.90% io_uring [kernel.vmlinux] [k] iov_iter_advance After: + 0.91% io_uring [kernel.vmlinux] [k] bio_iov_iter_get_pages_hint [no iov_iter_advance] # + patches 15,16 (switch optimisation) IOPS=4485984, IOS/call=32/32, inflight=32 (32) IOPS=4500384, IOS/call=32/32, inflight=32 (32) IOPS=4524512, IOS/call=32/32, inflight=32 (32) IOPS=4507424, IOS/call=32/32, inflight=32 (32) IOPS=4497216, IOS/call=32/32, inflight=32 (32) IOPS=4496832, IOS/call=32/32, inflight=32 (32) IOPS=4505632, IOS/call=32/32, inflight=32 (32) IOPS=4476224, IOS/call=32/32, inflight=32 (32) IOPS=4478592, IOS/call=32/32, inflight=32 (32) IOPS=4480128, IOS/call=32/32, inflight=32 (32) IOPS=4468640, IOS/call=32/32, inflight=32 (32) Before: + 1.92% io_uring [kernel.vmlinux] [k] submit_bio_checks + 5.56% io_uring [kernel.vmlinux] [k] blk_mq_submit_bio After: + 1.66% io_uring [kernel.vmlinux] [k] submit_bio_checks + 5.49% io_uring [kernel.vmlinux] [k] blk_mq_submit_bio 0.3% difference from perf, ~2% from absolute numbers, which is most probably just a coincidence. But 0.3% looks realistic. -- Pavel Begunkov