Re: [PATCH 00/16] block optimisation round

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 20 Oct 2021 15:54:35 +0100

On 10/19/21 22:24, Pavel Begunkov wrote:
Jens tried out a similar series with some not yet sent additions:
8.2-8.3 MIOPS -> ~9 MIOPS, or 8-10%.

12/16 is bulky, but it nicely drives the numbers. Moreover, with
it we can rid of some not used anymore optimisations in
__blkdev_direct_IO() because it awlays serve multiple bios.
E.g. no need in conditional referencing with DIO_MULTI_BIO,
and _probably_ can be converted to chained bio.
Some numbers, using nullblk is not perfect, but empirically
from numbers Jens posts his Optane setup usually gives somewhat
relatable results in terms of % difference. (probably, divide
the difference in percents by 2 for the worst case).

modprobe null_blk no_sched=1 irqmode=1 completion_nsec=0 submit_queues=16 poll_queues=32
echo 0 > /sys/block/nullb0/queue/iostats
echo 2 > /sys/block/nullb0/queue/nomerges
nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p1 -B1 -F1 -b512 /dev/nullb0
# polled=1, fixedbufs=1, register_files=1, buffered=0 QD=32, sq_ring=32, cq_ring=64

# baseline (for-5.16/block)

IOPS=4304768, IOS/call=32/32, inflight=32 (32)
IOPS=4289824, IOS/call=32/32, inflight=32 (32)
IOPS=4227808, IOS/call=32/32, inflight=32 (32)
IOPS=4187008, IOS/call=32/32, inflight=32 (32)
IOPS=4196992, IOS/call=32/32, inflight=32 (32)
IOPS=4208384, IOS/call=32/32, inflight=32 (32)
IOPS=4233888, IOS/call=32/32, inflight=32 (32)
IOPS=4266432, IOS/call=32/32, inflight=32 (32)
IOPS=4232352, IOS/call=32/32, inflight=32 (32)

# + patch 14/16 (skip advance)

IOPS=4367424, IOS/call=32/32, inflight=0 (16)
IOPS=4401088, IOS/call=32/32, inflight=32 (32)
IOPS=4400544, IOS/call=32/32, inflight=0 (29)
IOPS=4400768, IOS/call=32/32, inflight=32 (32)
IOPS=4409568, IOS/call=32/32, inflight=32 (32)
IOPS=4373888, IOS/call=32/32, inflight=32 (32)
IOPS=4392544, IOS/call=32/32, inflight=32 (32)
IOPS=4368192, IOS/call=32/32, inflight=32 (32)
IOPS=4362976, IOS/call=32/32, inflight=32 (32)

Comparing profiling. Before:
+    1.75%  io_uring  [kernel.vmlinux]  [k] bio_iov_iter_get_pages
+    0.90%  io_uring  [kernel.vmlinux]  [k] iov_iter_advance

After:
+    0.91%  io_uring  [kernel.vmlinux]  [k] bio_iov_iter_get_pages_hint
[no iov_iter_advance]

# + patches 15,16 (switch optimisation)

IOPS=4485984, IOS/call=32/32, inflight=32 (32)
IOPS=4500384, IOS/call=32/32, inflight=32 (32)
IOPS=4524512, IOS/call=32/32, inflight=32 (32)
IOPS=4507424, IOS/call=32/32, inflight=32 (32)
IOPS=4497216, IOS/call=32/32, inflight=32 (32)
IOPS=4496832, IOS/call=32/32, inflight=32 (32)
IOPS=4505632, IOS/call=32/32, inflight=32 (32)
IOPS=4476224, IOS/call=32/32, inflight=32 (32)
IOPS=4478592, IOS/call=32/32, inflight=32 (32)
IOPS=4480128, IOS/call=32/32, inflight=32 (32)
IOPS=4468640, IOS/call=32/32, inflight=32 (32)

Before:
+    1.92%  io_uring  [kernel.vmlinux]  [k] submit_bio_checks
+    5.56%  io_uring  [kernel.vmlinux]  [k] blk_mq_submit_bio
After:
+    1.66%  io_uring  [kernel.vmlinux]  [k] submit_bio_checks
+    5.49%  io_uring  [kernel.vmlinux]  [k] blk_mq_submit_bio

0.3% difference from perf, ~2% from absolute numbers, which is
most probably just a coincidence. But 0.3% looks realistic.

--
Pavel Begunkov