Re: [RFC PATCH 00/21] Secure VFIO, TDISP, SEV TIO

Alexey Kardashevskiy <aik@xxxxxxx> · Fri, 30 Aug 2024 14:38:21 +1000

On 30/8/24 09:41, Dan Williams wrote:
Alexey Kardashevskiy wrote:
[..]
- skipping various enforcements of non-SME or
SWIOTLB in the guest;

Is this based on some concept of private vs shared mode devices?

No mixed share+private DMA supported within the
same IOMMU.

What does this mean? A device may not have mixed mappings (makes sense),

Currently devices do not have an idea about private host memory (but it
is being worked on afaik).

Worked on where? You mean the PCI core indicating that a device is
private or not? Is that not indicated by guest-side TSM connection
state?
>>> or an IOMMU can not host devices that do not all agree on whether 
DMA is
private or shared?

The hardware allows that via hardware-assisted vIOMMU and I/O page
tables in the guest with C-bit takes into accound by the IOMMU but the
software support is missing right now. So for this initial drop, vTOM is
used for DMA - this thing says "everything below <addr> is private,
above <addr> - shared" so nothing needs to bother with the C-bit, and in
my exercise I set the <addr> to the allowed maximum.

So each IOMMUFD instance in the VM is either "all private mappings" or
"all shared". Could be half/half by moving that <addr> :)

I thought existing use cases assume that the CC-VM can trigger page
conversions at will without regard to a vTOM concept? It would be nice
to have that address-map separation arrangement, has not that ship
already sailed?

Mmm. I am either confusing you too much or not following you :) Any page 
can be converted, the proposed arrangement would require that 
convertion-candidate-pages are allocated from a specific pool?

There are two vTOMs - one in IOMMU to decide on Cbit for DMA trafic (I 
use this one), one in VMSA ("VIRTUAL_TOM") for guest memory (this 
exercise is not using it). Which one do you mean?

[..]
Would the device not just launch in "shared" mode until it is later
converted to private? I am missing the detail of why passing the device
on the command line requires that private memory be mapped early.

A sequencing problem.

QEMU "realizes" a VFIO device, it creates an iommufd instance which
creates a domain and writes to a DTE (a IOMMU descriptor for PCI BDFn).
And DTE is not updated after than. For secure stuff, DTE needs to be
slightly different. So right then I tell IOMMUFD that it will handle
private memory.

Then, the same VFIO "realize" handler maps the guest memory in iommufd.
I use the same flag (well, pointer to kvm) in the iommufd pinning code,
private memory is pinned and mapped (and related page state change
happens as the guest memory is made guest-owned in RMP).

QEMU goes to machine_reset() and calls "SNP LAUNCH UPDATE" (the actual
place changed recenly, huh) and the latter will measure the guest and
try making all guest memory private but it already happened => error.

I think I have to decouple the pinning and the IOMMU/DTE setting.

That said, the implication that private device assignment requires
hotplug events is a useful property. This matches nicely with initial
thoughts that device conversion events are violent and might as well be
unplug/replug events to match all the assumptions around what needs to
be updated.

For the initial drop, I tell QEMU via "-device vfio-pci,x-tio=true" that
it is going to be private so there should be no massive conversion.

That's a SEV-TIO RFC-specific hack, or a proposal?

Not sure at the moment :)

An approach that aligns more closely with the VFIO operational model,
where it maps and waits for guest faults / usages, is that QEMU would be
told that the device is "bind capable", because the host is not in a
position to assume how the guest will use the device. A "bind capable"
device operates in shared mode unless and until the guest triggers
private conversion.

True. I just started this exercise without QEMU DiscardManager. Now I 
rely on it but it either needs to allow dynamic flip from 
discarded==private to discarded==shared (should do for now) or  allow 3 
states for guest pages.

This requires the BME hack as MMIO and

Not sure what the "BME hack" is, I guess this is foreshadowing for later
in this story.
  >
BusMaster enable bits cannot be 0 after MMIO
validation is done

It would be useful to call out what is a TDISP requirement, vs
device-specific DSM vs host-specific TSM requirement. In this case I
assume you are referring to PCI 6.2 11.2.6 where it notes that TDIs must

Oh there is 6.2 already.

enter the TDISP ERROR state if BME is cleared after the device is
locked?

...but this begs the question of whether it needs to be avoided outright

Well, besides a couple of avoidable places (like testing INTx support
which we know is not going to work on VFs anyway), a standard driver
enables MSE first (and the value for the command register does not have
1 for BME) and only then BME. TBH I do not think writing BME=0 when
BME=0 already is "clearing" but my test device disagrees.

...but we should not be creating kernel policy around test devices. What
matters is real devices. Now, if it is likely that real / production
devices will go into the TDISP ERROR state by not coalescing MSE + BME
updates then we need a solution.

True but I do not even know who to ask this question :)

Given it is unlikely that TDISP support will be widespread any time soon
it is likely tenable to assume TDISP compatible drivers call a new:

    pci_enable(pdev, PCI_ENABLE_TARGET | PCI_ENABLE_INITIATOR);

...or something like that to coalesce command register writes.

Otherwise if that retrofit ends up being too much work or confusion then
the ROI of teaching the PCI core to recover this scenario needs to be
evaluated.

Agree.

or handled as an error recovery case dependending on policy.

Avoding seems more straight forward unless we actually want enlightened
device drivers which want to examine the interface report before
enabling the device. Not sure.

If TDISP capable devices trends towards a handful of devices in the near
term then some driver fixups seems reasonable. Otherwise if every PCI
device driver Linux has ever seens needs to be ready for that device to
have a TDISP capable flavor then mitigating this in the PCI core makes
more sense than playing driver whack-a-mole.
>
the guest OS booting process when this
appens.

SVSM could help addressing these (not
implemented at the moment).

At first though avoiding SVSM entanglements where the kernel can be
enlightened shoud be the policy. I would only expect SVSM hacks to cover
for legacy OSes that will never be TDISP enlightened, but in that case
we are likely talking about fully unaware L2. Lets assume fully
enlightened L1 for now.

Well, I could also tweak OVMF to make necessary calls to the PSP and
hack QEMU to postpone the command register updates to get this going,
just a matter of ugliness.

Per above, the tradeoff should be in ROI, not ugliness. I don't see how
OVMF helps when devices might be being virtually hotplugged or reset.

I have no clue how exactly hotplug works on x86, is not BIOS playing 
role in it? Thanks,

--
Alexey