On 8/10/21 10:37 AM, Jens Axboe wrote: > Hi, > > This is v3 of this patchset. We're back to passing the cache pointer > in the kiocb, I do think that's the cleanest and it's also the most > efficient approach. A patch has been added to remove a member from > the io_uring req_rw structure, so that the kiocb size bump doesn't > result in the per-command part of io_kiocb to bump into the next > cacheline. > > Another benefit of this approach is that we get per-ring caching. > That means if an application splits polled IO into two threads, one > doing submit and one doing reaps, then we still get the full benefit > of the bio caching. > > The tldr; here is that we get about a 10% bump in polled performance with > this patchset, as we can recycle bio structures essentially for free. > Outside of that, explanations in each patch. I've also got an iomap patch, > but trying to keep this single user until there's agreement on the > direction. > > Against for-5.15/io_uring, and can also be found in my > io_uring-bio-cache.3 branch. As a reference. Before the patch: axboe@amd ~/g/fio (master)> sudo taskset -c 0 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 /dev/nvme3n1 i 8, argc 9 Added file /dev/nvme3n1 (submitter 0) sq_ring ptr = 0x0x7f63a7c42000 sqes ptr = 0x0x7f63a7c40000 cq_ring ptr = 0x0x7f63a7c3e000 polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256 submitter=1600 IOPS=3111520, IOS/call=32/31, inflight=128 (128) or around 3.1M IOPS (single thread, single core), and after: axboe@amd ~/g/fio (master)> sudo taskset -c 0 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 /dev/nvme3n1 i 8, argc 9 Added file /dev/nvme3n1 (submitter 0) sq_ring ptr = 0x0x7f62726bc000 sqes ptr = 0x0x7f62726ba000 cq_ring ptr = 0x0x7f62726b8000 polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256 submitter=1791 IOPS=3417120, IOS/call=32/31, inflight=128 (128) which is about a ~10% increase in per-core IOPS for this kind of workload. -- Jens Axboe