On Thu, Aug 11, 2022 at 09:22:32AM +0200, Christoph Hellwig wrote: > On Wed, Aug 10, 2022 at 12:05:05PM -0600, Keith Busch wrote: > > The functions are implemented under 'include/linux/', indistinguishable from > > exported APIs. I think I understand why they are there, but they look the same > > as exported functions from a driver perspective. > > swiotlb.h is not a driver API. There's two leftovers used by the drm > code I'm trying to get fixed up, but in general the DMA API is the > interface and swiotlb is just an implementation detail. > > > Perhaps I'm being daft, but I'm totally missing why I should care if swiotlb > > leverages this feature. If you're using that, you've traded performance for > > security or compatibility already. If this idea can be used to make it perform > > better, then great, but that shouldn't be the reason to hold this up IMO. > > We firstly need to make sure that everything actually works on swiotlb, or > any other implementation that properly implements the DMA API. > > And the fact that I/O performance currently sucks and we can fix it on > the trusted hypervisor is an important consideration. At least as > importantant as micro-optimizing performance a little more on setups > not using them. So not taking care of both in one go seems rather silly > for a feature that is in its current form pretty intrusive and thus needs > a really good justification. Sorry for the delay response; I had some trouble with test setup. Okay, I will restart developing this with swiotlb in mind. In the mean time, I wanted to share some results with this series because I'm thinking this might be past the threshold for when we can drop the "micro-" prefix on optimisations. The most significant data points are these: * submission latency stays the same regardless of the transfer size or depth * IOPs is always equal or better (usually better) with up to 50% reduced cpu cost Based on this, I do think this type of optimisation is worth having a something like a new bio type. I know this introduces some complications in the io-path, but it is pretty minimal and doesn't add any size penalties to common structs for drivers that don't use them. Test details: fio with ioengine=io_uring 'none': using __user void* 'bvec': using buf registered with IORING_REGISTER_BUFFERS 'dma': using buf registered with IORING_REGISTER_MAP_BUFFERS (new) intel_iommu=on Results: (submission latency [slat] in nano-seconds) Q-Depth 1: Size | Premap | IOPs | slat | sys-cpu% .....|..........|.........|........|......... 4k | none | 41.4k | 2126 | 16.47% | bvec | 43.8k | 1843 | 15.79% | dma | 46.8k | 1504 | 14.94% 16k | none | 33.3k | 3279 | 17.78% | bvec | 33.9k | 2607 | 14.59% | dma | 40.2k | 1490 | 12.57% 64k | none | 18.7k | 6778 | 18.22% | bvec | 20.0k | 4626 | 13.80% | dma | 22.6k | 1586 | 7.58% Q-Depth 16: Size | Premap | IOPs | slat | sys-cpu% .....|..........|.........|........|......... 4k | none | 207k | 3657 | 72.81% | bvec | 219k | 3369 | 71.55% | dma | 310k | 2237 | 60.16% 16k | none | 164k | 5024 | 78.38% | bvec | 177k | 4553 | 76.29% | dma | 186k | 1880 | 43.56% 64k | none | 46.7k | 4424 | 30.51% | bvec | 46.7k | 4389 | 29.42% | dma | 46.7k | 1574 | 15.61%