On Wed, Mar 20, 2024 at 05:26:00PM +0100, Paolo Bonzini wrote: > On 3/20/24 09:39, Michael Roth wrote: > > Some subsystems like VFIO might disable ram block discard for > > uncoordinated cases. Since kvm_convert_memory()/guest_memfd don't > > implement a RamDiscardManager handler to convey discard operations to > > various listeners like VFIO. > Because of this, sequences like the > > following can result due to stale IOMMU mappings: > > Alternatively, should guest-memfd memory regions call > ram_block_discard_require(true)? This will prevent VFIO from operating, but > it will avoid consuming twice the memory. > > If desirable, guest-memfd support can be changed to implement an extension > of RamDiscardManager that notifies about private/shared memory changes, and > then guest-memfd would be able to support coordinated discard. But I wonder In an earlier/internal version of the SNP+gmem patches (when there was still a dedicated hostmem-memfd-private backend for restrictedmem/gmem), we had a rough implementation of RamDiscardManager that did this: https://github.com/AMDESE/qemu/blob/snp-latest-gmem-v12/backends/hostmem-memfd-private.c#L75 Now that gmem handling is mostly done transparently to the HostMem backend in use I'm not sure what the right place would be to implement something similar, but maybe it can be done in a more generic way. There were some notable downsides to that approach though that I'm a little hazy on now, but I think they were both kernel limitations: - VFIO seemed to have some limitation where it expects that the DMA mapping for a particular iova will be unmapped/mapped with the same granularity, but for an SNP guest there's no guarantee that if you flip a 2MB page from shared->private, that it won't later be flipped private->shared again but this time with a 4K granularity/sub-range. I think the current code still treats this as an -EINVAL case. So we end up needing to do everything with 4K granularity, which I *think* results in 4K IOMMU page table mappings, but I'd need to confirm. - VFIO doesn't seem to be optimized for this sort of use case and generally expects a much larger granularity and defaults to 64K max DMA entries, so for a 16GB guest you need to configure VFIO with something like: vfio_iommu_type1.dma_entry_limit=4194304 I didn't see any reason to suggest that's problematic but it makes we wonder if there's other stuff me might run into. > if that's doable at all - how common are shared<->private flips, and is it > feasible to change the IOMMU page tables every time? - For OVMF+guest kernel that don't do lazy-acceptance: I think the bulk of the flipping is during boot where most of shared GPA ranges get converted to private memory, and then later on the guest kernel switches memory back to to shared for stuff like SWIOTLB, and after that I think DMA mappings would be fairly stable. - For OVMF+guest kernel that support lazy-acceptance: The first 4GB get converted to private, and the rest remains shared until guest kernel needs to allocate memory from it. I'm not sure if SWIOTLB allocation is optimized to avoid unecessary flipping if it's allocated from that pool of still-shared memory, but normal/private allocations will result in a steady stream of DMA unmap operations as the guest faults in its working set. > > If the real solution is SEV-TIO (which means essentially guest_memfd support > for VFIO), calling ram_block_discard_require(true) may be the simplest > stopgap solution. Hard to guess how cloud vendors will feel about waiting for trusted I/O. It does make sense in the context of CoCo to expect them to wait, but would be nice to have a stop-gap to offer like disabling discard, since it has minimal requirements on the QEMU/VFIO side and might be enough to get early adopters up and running at least. All that said, if you think something based around RamDiscardManager seems tenable given all above then we can re-visit that approach as well. -Mike > > Paolo > > > - convert page shared->private > > - discard shared page > > - convert page private->shared > > - new page is allocated > > - issue DMA operations against that shared page > > > > Address this by taking ram_block_discard_is_enabled() into account when > > deciding whether or not to discard pages. > > > > Signed-off-by: Michael Roth <michael.roth@xxxxxxx> > > --- > > accel/kvm/kvm-all.c | 8 ++++++-- > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c > > index 53ce4f091e..6ae03c880f 100644 > > --- a/accel/kvm/kvm-all.c > > +++ b/accel/kvm/kvm-all.c > > @@ -2962,10 +2962,14 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) > > */ > > return 0; > > } else { > > - ret = ram_block_discard_range(rb, offset, size); > > + ret = ram_block_discard_is_disabled() > > + ? ram_block_discard_range(rb, offset, size) > > + : 0; > > } > > } else { > > - ret = ram_block_discard_guest_memfd_range(rb, offset, size); > > + ret = ram_block_discard_is_disabled() > > + ? ram_block_discard_guest_memfd_range(rb, offset, size) > > + : 0; > > } > > } else { > > error_report("Convert non guest_memfd backed memory region " >