Re: switch block layer polling to a bio based model

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/26/21 7:48 AM, Christoph Hellwig wrote:
> Hi all,
> 
> This series clean up the block polling code a bit and changes the interface
> to poll for a specific bio instead of a request_queue and cookie pair.
> 
> Polling for the bio itself leads to a few advantages:
> 
>   - the cookie construction can made entirely private in blk-mq.c
>   - the caller does not need to remember the request_queue and cookie
>     separately and thus sidesteps their lifetime issues
>   - keeping the device and the cookie inside the bio allows to trivially
>     support polling BIOs remapping by stacking drivers
>   - a lot of code to propagate the cookie back up the submission path can
>     removed entirely
> 
> The one major caveat is that this requires RCU freeing polled BIOs to make
> sure the bio that contains the polling information is still alive when
> io_uring tries to poll it through the iocb. For synchronous polling all the
> callers have a bio reference anyway, so this is not an issue.

Was curious about this separately, so ran a quick test on it. Running polled
IO on a fast device, performance drops about 10% with this applied. Outside
of that, we have ksoftirqd using 5-7% of CPU continually, just doing frees:

+   45.33%  ksoftirqd/0  [kernel.vmlinux]  [k] __slab_free
+   15.91%  ksoftirqd/0  [kernel.vmlinux]  [k] kmem_cache_free
+   12.66%  ksoftirqd/0  [kernel.vmlinux]  [k] rcu_cblist_dequeue
+    8.39%  ksoftirqd/0  [kernel.vmlinux]  [k] rcu_core
+    4.75%  ksoftirqd/0  [kernel.vmlinux]  [k] free_one_page
+    3.27%  ksoftirqd/0  [kernel.vmlinux]  [k] bio_free_rcu
+    1.98%  ksoftirqd/0  [kernel.vmlinux]  [k] mempool_free_slab

This all means that we go from 2.97M IOPS to 2.70M IOPS in that
particular test (QD=128, async polled).

I was separately curious about this as I have a (as of yet unposted)
patchset that recycles bio allocations, as we spend quite a bit of time
doing that for high rate polled IO. It's good for taking the above 2.97M
IOPS to 3.2-3.3M IOPS, and it'd obviously be a bit more problematic with
required RCU freeing of bio's. Even without the alloc cache, using RCU
will ruin any potential cache locality on back-to-back bio free + bio
alloc.

-- 
Jens Axboe




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux