On 2023-04-25 16:58, Jason Gunthorpe wrote:
On Tue, Apr 25, 2023 at 03:48:11PM +0100, Robin Murphy wrote:
On 2023-04-21 18:58, Jason Gunthorpe wrote:
On Fri, Apr 21, 2023 at 06:22:37PM +0100, Robin Murphy wrote:
I think a slightly more considered and slightly less wrong version of that
idea is to mark it as IOMMU_RESV_MSI, and special-case direct-mapping those
on Arm (I believe it would technically be benign to do on x86 too, but might
annoy people with its pointlessness). However...
I'd rather have a IOMMU_RESV_MSI_DIRECT and put the ARM special case
in ARM code..
Maybe, but it's still actually broken either way, because how do you get
that type into the VM? Firmware can't encode that a particular RMR
represents the special magic hack for IOMMUFD, so now the SMMU driver needs
to somehow be aware when it's running in a VM offering nested translation
and do some more magic to inject the appropriate region, and it's all
just... no.
Er, I figured ARM had sorted this out somehow :(
Eric, do you know anything about this? Where did you setup the 1:1 map
in the VM in your series?
So you are saying, the basic problem statement, is that the ACPI table
that describes the ITS direct mapping in the VM is supposed to be
interpreted by the SMMU driver as "memory must be iommu mapped 1:1 at
all times and is possibly dangerous enough to block userspace access
to the device, like Intel does" ?
In the case of IORT RMRs, all firmware can describe is regions of memory
which must be direct-mapped, along with some attributes for *how* they
are to be mapped. It cannot (and frankly should not) convey any
additional OS-specific special meaning.
This isn't end of the world bad, it just means that VFIO will not work
in ARM guests under this interrupt model. Sad, and something to fix,
but we can still cover alot of ground..
Maybe a GICv5 can correct it..
Correct what? If Linux chooses to use the existing GICv3 architecture in
an arse-backwards manner to opaquely emulate interrupts with other
interrupts instead of actually virtualising them, I'd say that's hardly
the architecture's fault. Given how much many of our partners do care
about virtualisation performance, I'm pretty confident the GIC/SMMU
architects aren't going to be spending time on future extensions for
making horribly inefficient interrupt emulation any easier to implement.
The primary purpose of the Interrupt Translation Service is to
*translate* interrupts. In combination with an SMMU, it allows a VM to
program a device with a virtualised address and EventID such that when
the device signals an interrupt by writing that EventID to that address,
the SMMU translates the address to direct the write to the appropriate
physical ITS, then the ITS translates the EventID into a physical LPI,
which may then be handled directly or stuffed back into the VM as the
appropriate virtual interrupt. The nature of these translations is such
that the host has no need to mediate access to the device itself.
Frankly, I think something whent wrong with the GICv4 design. A purely
virtualization focused GIC should not have continued to rely on the
hypervisor trapping of the MSI-X writes. The guest should have had a
real data value and a real physical ITS page.
A real *virtual* ITS page (IPA/GPA, not PA) as above, because the Arm
system architecture does not define a fixed address map thus that is
free to be virtualised as well, but otherwise, yes, indeed it should,
and it could, on both GICv3 and AMD/Intel IRQ remapping. Why isn't Linux
letting VMMs do that?
Fair enough for VFIO, since that existed long before any architectural
MSI support on Arm, and has crusty PCI/X legacy on x86 to deal with too,
but IOMMUFD is a new thing for modern use-cases on modern hardware. In
today's PCIe world why would we choose to focus on banging the
legacy-shaped peg into the modern-interrupt-remapping-shaped hole with
our foreheads, then complain that it hurts, rather than design the new
thing to use the modern functionality as intended, and have something
that can work really well for what users are actually going to want to do?
...I believe the remaining missing part is a UAPI for the VMM to ask the
host kernel to configure a "physical" vLPI for a given device and EventID,
at the point when its vITS emulation is handling the guest's configuration
command. With that we would no longer have to rewrite the MSI payload
either, so can avoid trapping the device's MSI-X capability at all, and the
VM could actually have non-terrible interrupt performance.
Yes.. More broadly I think we'd need to allow the vGIC code to
understand that it has complete control over a SID, and like we are
talking about for SMMU a vSID mapping as well.
Marc, any thoughts on how much of this is actually missing from the MSI
domain infrastructure today? I'm guessing the main thing is needing some
sort of msi_domain_alloc_virq() API so that the caller can provide the
predetermined message data to replace the normal compose/write flow -
beyond that I think I can see a tentative idea of how it could then be
plumbed through the generic msi_alloc_info and the ITS domain. The x86
irq_remapping domains scare me, though, so I'll say right now I'm not
going there :)
Thanks,
Robin.
This would have to replace the eventfd based hookup we have now.
I really want to avoid opening this can of worms because it is
basically iommufd all over again just irq focused :(
Except GCC says __builtin_ctzl(0) is undefined, so although I'd concur that
the chances of nasal demons at the point of invoking __ffs() are
realistically quite low, I don't fancy arguing that with the static checker
brigade. So by the time we've appeased them with additional checks,
initialisations, etc., we'd have basically the same overhead as running 0
iterations of another for loop (the overwhelmingly common case anyway), but
in more lines of code, with a more convoluted flow. All of which leads me to
conclude that "number of times we walk a usually-empty list in a one-off
slow path" is not in fact the most worthwhile thing to optimise for ;)
Heh, well fair enough, we do have a UBSAN that might trip on this. Lu
can correct it
Jason