On Mon, 2022-12-05 at 18:24 +0000, Robin Murphy wrote: > On 2022-12-02 14:29, Niklas Schnelle wrote: > > On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote: > > > On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote: > > > > On 2022-11-29 12:00, Niklas Schnelle wrote: > > > > > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote: > > > > > > On 2022-11-16 17:16, Niklas Schnelle wrote: > > > > > > > When RPCIT indicates that the underlying hypervisor has run out of > > > > > > > resources it often means that its IOVA space is exhausted and IOVAs need > > > > > > > to be freed before new ones can be created. By triggering a flush of the > > > > > > > IOVA queue we can get the queued IOVAs freed and also get the new > > > > > > > mapping established during the global flush. > > > > > > > > > > > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is > > > > > > exhausted and fail the DMA API call before even getting as far as > > > > > > iommu_map(), though? Or is there some less obvious limitation like a > > > > > > maximum total number of distinct IOVA regions regardless of size? > > > > > > > > > > Well, yes and no. Your thinking is of course correct if the advertised > > > > > available IOVA space can be fully utilized without exhausting > > > > > hypervisor resources we won't trigger this case. However sadly there > > > > > are complications. The most obvious being that in QEMU/KVM the > > > > > restriction of the IOVA space to what QEMU can actually have mapped at > > > > > once was just added recently[0] prior to that we would regularly go > > > > > through this "I'm out of resources free me some IOVAs" dance with our > > > > > existing DMA API implementation where this just triggers an early cycle > > > > > of freeing all unused IOVAs followed by a global flush. On z/VM I know > > > > > of no situations where this is triggered. That said this signalling is > > > > > architected so z/VM may have corner cases where it does this. On our > > > > > bare metal hypervisor (no paging) this return code is unused and IOTLB > > > > > flushes are simply hardware cache flushes as on bare metal platforms. > > > > > > > > > > [0] > > > > > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@xxxxxxxxxxxxx/ > > > > > > > > That sheds a bit more light, thanks, although I'm still not confident I > > > > fully understand the whole setup. AFAICS that patch looks to me like > > > > it's putting a fixed limit on the size of the usable address space. That > > > > in turn implies that "free some IOVAs and try again" might be a red > > > > herring and never going to work; for your current implementation, what > > > > that presumably means in reality is "free some IOVAs, resetting the > > > > allocator to start allocating lower down in the address space where it > > > > will happen to be below that limit, and try again", but the iommu-dma > > > > allocator won't do that. If it doesn't know that some arbitrary range > > > > below the top of the driver-advertised aperture is unusable, it will > > > > just keep allocating IOVAs up there and mappings will always fail. > > > > > > > > If the driver can't accurately represent the usable IOVA space via the > > > > aperture and/or reserved regions, then this whole approach seems doomed. > > > > If on the other hand I've misunderstood and you can actually still use > > > > any address, just not all of them at the same time, > > > > > > > > > This is exactly it, the problem is a limit on the number of IOVAs that > > > are concurrently mapped. In QEMU pass-through the tightest limit is > > > usually the one set by the host kernel parameter > > > vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With > > > IOMMU_DOMAIN_DMA we stay under this limit without extra action but once > > > there is a flush queue (including the existing per-CPU one) where each > > > entry may keep many pages lazily unmapped this is easly hit with fio > > > bandwidth tests on an NVMe. For this case this patch works reliably > > > because of course the number of actually active mappings without the > > > lazily freed ones is similar to the number of active ones with > > > IOMMU_DOMAIN_DMA. > > > > > > > then it might in > > > > fact be considerably easier to skip the flush queue mechanism entirely > > > > and implement this internally to the driver - basically make .iotlb_sync > > > > a no-op for non-strict DMA domains, > > > > > > I'm assuming you mean .iotlb_sync_map above. > > No, I did mean .iotlb_sync, however on reflection that was under the > assumption that it's OK for the hypervisor to see a new mapping for a > previously-used IOVA without having seen it explicitly unmapped in > between. Fair enough if that isn't the case, but if it is then your > pagetable can essentially act as the "flush queue" by itself. Hmm, I see. We do want the hypervisor to see mappings as invalid before they are changed again to a new valid mapping. In case of e.g. QEMU/KVM this allows the hypervisor to itself rely on the hardware to fill in the IOTLB on map i.e. use a no-op .iotlb_sync_map and a .iotlb_sync that flushes the hardware IOTLB. Also this allows QEMU to emulate the IOMMU with just map/unmap primitives on vfio/iommufd. Consequently, I recently found that vfio-pci pass-through works in combination with an emulated s390 guest on an x86_64 host albeit very slowly of course. > > > > > put the corresponding RPCIT flush > > > > and retry in .sync_map, then allow that to propagate an error back to > > > > iommu_map() if the new mapping still hasn't taken. > > > > > > > > Thanks, > > > > Robin. > > > > > > Hmm, interesting. This would leave the IOVAs in the flush queue lazily > > > unmapped and thus still block their re-use but free their host > > > resources via a global RPCIT allowing the guest to use a different > > > porition of the IOVA space with those resources. It could work, though > > > I need to test it, but it feels a bit clunky. > > > > > > Maybe we can go cleaner while using this idea of not having to flush > > > the queue but just freeing their host side resources. If we allowed > > > .iotlb_sync_map to return an error that fails the mapping operation, > > > then we could do it all in there. In the normal case it just does the > > > RPCIT but if that returns that the hypervisor ran out of resources it > > > does another global RPCIT allowing the hypervisor to free IOVAs that > > > were lazily unmapped. If the latter succeeds all is good if not then > > > the mapping operation failed. Logically it makes sense too, > > > .iotlb_sync_map is the final step of syncing the mapping to the host > > > which can fail just like the mapping operation itself. > > > > > > Apart from the out of active IOVAs case this would also handle the > > > other useful error case when using .iotlb_sync_map for shadowing where > > > it fails because the host ran against a pinned pages limit or out of > > > actual memory. Not by fixing it but at least we would get a failed > > > mapping operation. > > > > > > The other callbacks .flush_iotlb_all and .iotlb_sync > > > could stay the same as they are only used for unmapped pages where we > > > can't reasonably run out of resources in the host neither active IOVAs > > > nor pinned pages. > > > > > > > Ok, I've done some testing with the above idea and this seems to work > > great. I've verified that my version of QEMU (without Matt's IOVA > > aperture resrtriction patch) creates the RPCIT out of resource > > indications and then the global flush in .iotlb_sync_map is triggered > > and allows QEMU to unpin pages and free IOVAs while the guest still has > > them lazily unpapeg (sitting in the flush queue) and thus uses > > different IOVAs. > > > > @Robin @Joerg, if you are open to changing .iotlb_sync_map such that it > > can return and error and then failing the mapping operation I think > > this is a great approach. One advantage over the previous approach of > > flushing the queue isthat this should work for the pure IOMMU API too. > > Whatever happens I think allowing .iotlb_sync_map to propagate an error > out through iommu_map() is an appropriate thing to do - it sounds like > s390 might technically need that for regular IOMMU API correctness in > some circumstances anyway. Besides, even in the cases where it > represents "simple" TLB maintenance, there are potentially ways that > could fail (like timing out if the IOMMU has gone completely wrong), so > it wouldn't seem entirely unreasonable if a driver might want to report > overall failure if it can't guarantee that the new mapping will actually > be usable. > > Thanks, > Robin. > Great then I'll use this approach for v3 or maybe even sent this separately as a prerequisite IOMMU fix as yes this is also required for the IOMMU API to work in a guest where the hypervisor may report running out of resources. >