Re: [RFC PATCH 00/21] Secure VFIO, TDISP, SEV TIO

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 29 Aug 2024 16:41:34 -0700

Alexey Kardashevskiy wrote:
[..]
> >> - skipping various enforcements of non-SME or
> >> SWIOTLB in the guest;
> > 
> > Is this based on some concept of private vs shared mode devices?
> > 
> >> No mixed share+private DMA supported within the
> >> same IOMMU.
> > 
> > What does this mean? A device may not have mixed mappings (makes sense),
> 
> Currently devices do not have an idea about private host memory (but it 
> is being worked on afaik).

Worked on where? You mean the PCI core indicating that a device is
private or not? Is that not indicated by guest-side TSM connection
state?

> > or an IOMMU can not host devices that do not all agree on whether DMA is
> > private or shared?
> 
> The hardware allows that via hardware-assisted vIOMMU and I/O page 
> tables in the guest with C-bit takes into accound by the IOMMU but the 
> software support is missing right now. So for this initial drop, vTOM is 
> used for DMA - this thing says "everything below <addr> is private, 
> above <addr> - shared" so nothing needs to bother with the C-bit, and in 
> my exercise I set the <addr> to the allowed maximum.
> 
> So each IOMMUFD instance in the VM is either "all private mappings" or 
> "all shared". Could be half/half by moving that <addr> :)

I thought existing use cases assume that the CC-VM can trigger page
conversions at will without regard to a vTOM concept? It would be nice
to have that address-map separation arrangement, has not that ship
already sailed?

[..]
> > Would the device not just launch in "shared" mode until it is later
> > converted to private? I am missing the detail of why passing the device
> > on the command line requires that private memory be mapped early.
> 
> A sequencing problem.
> 
> QEMU "realizes" a VFIO device, it creates an iommufd instance which 
> creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn). 
> And DTE is not updated after than. For secure stuff, DTE needs to be 
> slightly different. So right then I tell IOMMUFD that it will handle 
> private memory.
> 
> Then, the same VFIO "realize" handler maps the guest memory in iommufd. 
> I use the same flag (well, pointer to kvm) in the iommufd pinning code, 
> private memory is pinned and mapped (and related page state change 
> happens as the guest memory is made guest-owned in RMP).
> 
> QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual 
> place changed recenly, huh) and the latter will measure the guest and 
> try making all guest memory private but it already happened => error.
> 
> I think I have to decouple the pinning and the IOMMU/DTE setting.
> 
> > That said, the implication that private device assignment requires
> > hotplug events is a useful property. This matches nicely with initial
> > thoughts that device conversion events are violent and might as well be
> > unplug/replug events to match all the assumptions around what needs to
> > be updated.
> 
> For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that 
> it is going to be private so there should be no massive conversion.

That's a SEV-TIO RFC-specific hack, or a proposal?

An approach that aligns more closely with the VFIO operational model,
where it maps and waits for guest faults / usages, is that QEMU would be
told that the device is "bind capable", because the host is not in a
position to assume how the guest will use the device. A "bind capable"
device operates in shared mode unless and until the guest triggers
private conversion.

> >> This requires the BME hack as MMIO and
> > 
> > Not sure what the "BME hack" is, I guess this is foreshadowing for later
> > in this story.
>  >
> >> BusMaster enable bits cannot be 0 after MMIO
> >> validation is done
> > 
> > It would be useful to call out what is a TDISP requirement, vs
> > device-specific DSM vs host-specific TSM requirement. In this case I
> > assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must
> 
> Oh there is 6.2 already.
> 
> > enter the TDISP ERROR state if BME is cleared after the device is
> > locked?
> > 
> > ...but this begs the question of whether it needs to be avoided outright
> 
> Well, besides a couple of avoidable places (like testing INTx support 
> which we know is not going to work on VFs anyway), a standard driver 
> enables MSE first (and the value for the command register does not have 
> 1 for BME) and only then BME. TBH I do not think writing BME=0 when 
> BME=0 already is "clearing" but my test device disagrees.

...but we should not be creating kernel policy around test devices. What
matters is real devices. Now, if it is likely that real / production
devices will go into the TDISP ERROR state by not coalescing MSE + BME
updates then we need a solution.

Given it is unlikely that TDISP support will be widespread any time soon
it is likely tenable to assume TDISP compatible drivers call a new:

   pci_enable(pdev, PCI_ENABLE_TARGET | PCI_ENABLE_INITIATOR);

...or something like that to coalesce command register writes.

Otherwise if that retrofit ends up being too much work or confusion then
the ROI of teaching the PCI core to recover this scenario needs to be
evaluated.

> > or handled as an error recovery case dependending on policy.
> 
> Avoding seems more straight forward unless we actually want enlightened 
> device drivers which want to examine the interface report before 
> enabling the device. Not sure.

If TDISP capable devices trends towards a handful of devices in the near
term then some driver fixups seems reasonable. Otherwise if every PCI
device driver Linux has ever seens needs to be ready for that device to
have a TDISP capable flavor then mitigating this in the PCI core makes
more sense than playing driver whack-a-mole.

> >> the guest OS booting process when this
> >> appens.
> >>
> >> SVSM could help addressing these (not
> >> implemented at the moment).
> > 
> > At first though avoiding SVSM entanglements where the kernel can be
> > enlightened shoud be the policy. I would only expect SVSM hacks to cover
> > for legacy OSes that will never be TDISP enlightened, but in that case
> > we are likely talking about fully unaware L2. Lets assume fully
> > enlightened L1 for now.
> 
> Well, I could also tweak OVMF to make necessary calls to the PSP and 
> hack QEMU to postpone the command register updates to get this going, 
> just a matter of ugliness.

Per above, the tradeoff should be in ROI, not ugliness. I don't see how
OVMF helps when devices might be being virtually hotplugged or reset.