On 2021-06-03 13:32, Jussi Maki wrote:
On Wed, Jun 2, 2021 at 2:49 PM Robin Murphy <robin.murphy@xxxxxxx> wrote:
Thanks for the quick response & patch. I tried it out and indeed it
does solve the issue:
Cool, thanks Jussi. May I infer a Tested-by tag from that?
Of course!
Given that the race looks to have been pretty theoretical until now, I'm
not convinced it's worth the bother of digging through the long history
of default domain and DMA ops movement to figure where it started, much
less attempt invasive backports. The flush queue change which made it
apparent only landed in 5.13-rc1, so as long as we can get this in as a
fix in the current cycle we should be golden - in the meantime, note
that booting with "iommu.strict=0" should also restore the expected
behaviour.
FWIW I do still plan to resend the patch "properly" soon (in all honesty
it wasn't even compile-tested!)
BTW, even with the patch there's quite a bit of spin lock contention
coming from ice_xmit_xdp_ring->dma_map_page_attrs->...->alloc_iova.
CPU load drops from 85% to 20% (~80Mpps, 64b UDP) when iommu is
disabled. Is this type of overhead to be expected?
Yes, IOVA allocation can still be a bottleneck - the percpu caching
system mostly alleviates it, but certain workloads can still defeat
that, and if you're spending significant time in alloc_iova() rather
than alloc_iova_fast() then it sounds like yours is one of them.
If you're using small IOVA sizes which *should* be cached, then you
might be running into a pathological case of thrashing the global depot.
I've ranted before about the fixed MAX_GLOBAL_MAGS probably being too
small for systems with more than 16 CPUs, which on a modern AMD system I
imagine you may well have.
If on the other hand your workload is making larger mappings above the
IOVA caching threshold, then please take a look at John's series for
making that tuneable:
https://lore.kernel.org/linux-iommu/1622557781-211697-1-git-send-email-john.garry@xxxxxxxxxx/
Cheers,
Robin.