Re: [PATCH v3 19/49] kvm: Make kvm_convert_memory() obey ram_block_discard_is_enabled()

Michael Roth <michael.roth@xxxxxxx> · Wed, 20 Mar 2024 14:47:54 -0500

On Wed, Mar 20, 2024 at 05:26:00PM +0100, Paolo Bonzini wrote:
> On 3/20/24 09:39, Michael Roth wrote:
> > Some subsystems like VFIO might disable ram block discard for
> > uncoordinated cases. Since kvm_convert_memory()/guest_memfd don't
> > implement a RamDiscardManager handler to convey discard operations to
> > various listeners like VFIO. > Because of this, sequences like the
> > following can result due to stale IOMMU mappings:
> 
> Alternatively, should guest-memfd memory regions call
> ram_block_discard_require(true)?  This will prevent VFIO from operating, but
> it will avoid consuming twice the memory.
> 
> If desirable, guest-memfd support can be changed to implement an extension
> of RamDiscardManager that notifies about private/shared memory changes, and
> then guest-memfd would be able to support coordinated discard.  But I wonder

In an earlier/internal version of the SNP+gmem patches (when there was still
a dedicated hostmem-memfd-private backend for restrictedmem/gmem), we had a
rough implementation of RamDiscardManager that did this:

  https://github.com/AMDESE/qemu/blob/snp-latest-gmem-v12/backends/hostmem-memfd-private.c#L75

Now that gmem handling is mostly done transparently to the HostMem
backend in use I'm not sure what the right place would be to implement
something similar, but maybe it can be done in a more generic way.

There were some notable downsides to that approach though that I'm a
little hazy on now, but I think they were both kernel limitations:

  - VFIO seemed to have some limitation where it expects that the
    DMA mapping for a particular iova will be unmapped/mapped with
    the same granularity, but for an SNP guest there's no guarantee
    that if you flip a 2MB page from shared->private, that it won't
    later be flipped private->shared again but this time with a 4K
    granularity/sub-range. I think the current code still treats
    this as an -EINVAL case. So we end up needing to do everything
    with 4K granularity, which I *think* results in 4K IOMMU page
    table mappings, but I'd need to confirm.

  - VFIO doesn't seem to be optimized for this sort of use case and
    generally expects a much larger granularity and defaults to 64K
    max DMA entries, so for a 16GB guest you need to configure VFIO
    with something like:

      vfio_iommu_type1.dma_entry_limit=4194304

    I didn't see any reason to suggest that's problematic but it
    makes we wonder if there's other stuff me might run into.

> if that's doable at all - how common are shared<->private flips, and is it
> feasible to change the IOMMU page tables every time?

- For OVMF+guest kernel that don't do lazy-acceptance:

  I think the bulk of the flipping is during boot where most of
  shared GPA ranges get converted to private memory, and then
  later on the guest kernel switches memory back to to shared
  for stuff like SWIOTLB, and after that I think DMA mappings
  would be fairly stable.

- For OVMF+guest kernel that support lazy-acceptance:

  The first 4GB get converted to private, and the rest remains
  shared until guest kernel needs to allocate memory from it.
  I'm not sure if SWIOTLB allocation is optimized to avoid
  unecessary flipping if it's allocated from that pool of
  still-shared memory, but normal/private allocations will
  result in a steady stream of DMA unmap operations as the
  guest faults in its working set.

> 
> If the real solution is SEV-TIO (which means essentially guest_memfd support
> for VFIO), calling ram_block_discard_require(true) may be the simplest
> stopgap solution.

Hard to guess how cloud vendors will feel about waiting for trusted I/O.
It does make sense in the context of CoCo to expect them to wait, but
would be nice to have a stop-gap to offer like disabling discard, since
it has minimal requirements on the QEMU/VFIO side and might be enough to
get early adopters up and running at least.

All that said, if you think something based around RamDiscardManager
seems tenable given all above then we can re-visit that approach as well.

-Mike

> 
> Paolo
> 
> >    - convert page shared->private
> >    - discard shared page
> >    - convert page private->shared
> >    - new page is allocated
> >    - issue DMA operations against that shared page
> > 
> > Address this by taking ram_block_discard_is_enabled() into account when
> > deciding whether or not to discard pages.
> > 
> > Signed-off-by: Michael Roth <michael.roth@xxxxxxx>
> > ---
> >   accel/kvm/kvm-all.c | 8 ++++++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index 53ce4f091e..6ae03c880f 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -2962,10 +2962,14 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> >                   */
> >                   return 0;
> >               } else {
> > -                ret = ram_block_discard_range(rb, offset, size);
> > +                ret = ram_block_discard_is_disabled()
> > +                      ? ram_block_discard_range(rb, offset, size)
> > +                      : 0;
> >               }
> >           } else {
> > -            ret = ram_block_discard_guest_memfd_range(rb, offset, size);
> > +            ret = ram_block_discard_is_disabled()
> > +                  ? ram_block_discard_guest_memfd_range(rb, offset, size)
> > +                  : 0;
> >           }
> >       } else {
> >           error_report("Convert non guest_memfd backed memory region "
>