Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication

Robin Murphy <robin.murphy@xxxxxxx> · Mon, 5 Dec 2022 18:24:04 +0000

On 2022-12-02 14:29, Niklas Schnelle wrote:
On Tue, 2022-11-29 at 15:40 +0100, Niklas Schnelle wrote:
On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote:
On 2022-11-29 12:00, Niklas Schnelle wrote:
On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
On 2022-11-16 17:16, Niklas Schnelle wrote:
When RPCIT indicates that the underlying hypervisor has run out of
resources it often means that its IOVA space is exhausted and IOVAs need
to be freed before new ones can be created. By triggering a flush of the
IOVA queue we can get the queued IOVAs freed and also get the new
mapping established during the global flush.

Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
exhausted and fail the DMA API call before even getting as far as
iommu_map(), though? Or is there some less obvious limitation like a
maximum total number of distinct IOVA regions regardless of size?

Well, yes and no. Your thinking is of course correct if the advertised
available IOVA space can be fully utilized without exhausting
hypervisor resources we won't trigger this case. However sadly there
are complications. The most obvious being that in QEMU/KVM the
restriction of the IOVA space to what QEMU can actually have mapped at
once was just added recently[0] prior to that we would regularly go
through this "I'm out of resources free me some IOVAs" dance with our
existing DMA API implementation where this just triggers an early cycle
of freeing all unused IOVAs followed by a global flush. On z/VM I know
of no situations where this is triggered. That said this signalling is
architected so z/VM may have corner cases where it does this. On our
bare metal hypervisor (no paging) this return code is unused and IOTLB
flushes are simply hardware cache flushes as on bare metal platforms.

[0]
https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@xxxxxxxxxxxxx/

That sheds a bit more light, thanks, although I'm still not confident I
fully understand the whole setup. AFAICS that patch looks to me like
it's putting a fixed limit on the size of the usable address space. That
in turn implies that "free some IOVAs and try again" might be a red
herring and never going to work; for your current implementation, what
that presumably means in reality is "free some IOVAs, resetting the
allocator to start allocating lower down in the address space where it
will happen to be below that limit, and try again", but the iommu-dma
allocator won't do that. If it doesn't know that some arbitrary range
below the top of the driver-advertised aperture is unusable, it will
just keep allocating IOVAs up there and mappings will always fail.

If the driver can't accurately represent the usable IOVA space via the
aperture and/or reserved regions, then this whole approach seems doomed.
If on the other hand I've misunderstood and you can actually still use
any address, just not all of them at the same time,

This is exactly it, the problem is a limit on the number of IOVAs that
are concurrently mapped. In QEMU pass-through the tightest limit is
usually the one set by the host kernel parameter
vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With
IOMMU_DOMAIN_DMA we stay under this limit without extra action but once
there is a flush queue (including the existing per-CPU one) where each
entry may keep many pages lazily unmapped this is easly hit with fio
bandwidth tests on an NVMe. For this case this patch works reliably
because of course the number of actually active mappings without the
lazily freed ones is similar to the number of active ones with
IOMMU_DOMAIN_DMA.

  then it might in
fact be considerably easier to skip the flush queue mechanism entirely
and implement this internally to the driver - basically make .iotlb_sync
a no-op for non-strict DMA domains,

I'm assuming you mean .iotlb_sync_map above.

No, I did mean .iotlb_sync, however on reflection that was under the 
assumption that it's OK for the hypervisor to see a new mapping for a 
previously-used IOVA without having seen it explicitly unmapped in 
between. Fair enough if that isn't the case, but if it is then your 
pagetable can essentially act as the "flush queue" by itself.

  put the corresponding RPCIT flush
and retry in .sync_map, then allow that to propagate an error back to
iommu_map() if the new mapping still hasn't taken.

Thanks,
Robin.

Hmm, interesting. This would leave the IOVAs in the flush queue lazily
unmapped and thus still block their re-use but free their host
resources via a global RPCIT allowing the guest to use a different
porition of the IOVA space with those resources. It could work, though
I need to test it, but it feels a bit clunky.

Maybe we can go cleaner while using this idea of not having to flush
the queue but just freeing their host side resources. If we allowed
.iotlb_sync_map to return an error that fails the mapping operation,
then we could do it all in there. In the normal case it just does the
RPCIT but if that returns that the hypervisor ran out of resources it
does another global RPCIT allowing the hypervisor to free IOVAs that
were lazily unmapped. If the latter succeeds all is good if not then
the mapping operation failed. Logically it makes sense too,
.iotlb_sync_map is the final step of syncing the mapping to the host
which can fail just like the mapping operation itself.

Apart from the out of active IOVAs case this would also handle the
other useful error case when using .iotlb_sync_map for shadowing where
it fails because the host ran against a pinned pages limit or out of
actual memory. Not by fixing it but at least we would get a failed
mapping operation.

The other callbacks .flush_iotlb_all and .iotlb_sync
could stay the same as they are only used for unmapped pages where we
can't reasonably run out of resources in the host neither active IOVAs
nor pinned pages.

Ok, I've done some testing with the above idea and this seems to work
great. I've verified that my version of QEMU (without Matt's IOVA
aperture resrtriction patch) creates the RPCIT out of resource
indications and then the global flush in .iotlb_sync_map is triggered
and allows QEMU to unpin pages and free IOVAs while the guest still has
them lazily unpapeg (sitting in the flush queue) and thus uses
different IOVAs.

@Robin @Joerg, if you are open to changing .iotlb_sync_map such that it
can return and error and then failing the mapping operation I think
this is a great approach. One advantage over the previous approach of
flushing the queue isthat this should work for the pure IOMMU API too.

Whatever happens I think allowing .iotlb_sync_map to propagate an error 
out through iommu_map() is an appropriate thing to do - it sounds like 
s390 might technically need that for regular IOMMU API correctness in 
some circumstances anyway. Besides, even in the cases where it 
represents "simple" TLB maintenance, there are potentially ways that 
could fail (like timing out if the IOMMU has gone completely wrong), so 
it wouldn't seem entirely unreasonable if a driver might want to report 
overall failure if it can't guarantee that the new mapping will actually 
be usable.

Thanks,
Robin.

If you don't want to change the signature of .iotlb_sync_map I think we
can do Robin's idea and have .iotlb_sync_map as a no-op and do the
RPCIT sync as part of the s390_iommu_map_pages(). This would hide what
really is our variant of .iotlb_sync_map in the mapping code though
which I don't super like. Besides that it would also cause more RPCITs
in __iommu_map_sg() as we could no longer use a single RPCIT for the
entire range.

Thanks,
Niklas