Re: [PATCH v3 0/4] Balloon inhibit enhancements, vfio restriction

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 8 Aug 2018 00:58:32 +0300
"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:

> On Tue, Aug 07, 2018 at 01:53:03PM -0600, Alex Williamson wrote:
> > On Tue, 7 Aug 2018 22:44:56 +0300
> > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> >   
> > > On Tue, Aug 07, 2018 at 01:31:21PM -0600, Alex Williamson wrote:  
> > > > v3:
> > > >  - Drop "nested" term in commit log (David)
> > > >  - Adopt suggested wording in ccw code (Cornelia)
> > > >  - Explain balloon inhibitor usage in vfio common (Peter)
> > > >  - Fix to call inhibitor prior to re-using existing containers
> > > >    to avoid gap that pinning may have occurred in set container
> > > >    ioctl (self) - Peter, this change is the reason I didn't
> > > >    include your R-b.
> > > >  - Add R-b to patches 1 & 2
> > > > 
> > > > v2:
> > > >  - Use atomic ops for balloon inhibit counter (Peter)
> > > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > > >    devices, no support added for platform as there are no platform
> > > >    mdev devices.
> > > > 
> > > > See patch 3/4 for detailed explanation why ballooning and device
> > > > assignment typically don't mix.  If this eventually changes, flags
> > > > on the iommu info struct or perhaps device info struct can inform
> > > > us for automatic opt-in.  Thanks,
> > > > 
> > > > Alex    
> > > 
> > > One of the issues with pass-through is that it breaks overcommit
> > > through swap. ballooning seems to offer one solution, instead of
> > > making it work this patch just attempts to block ballooning.
> > > 
> > > I guess it's better than corrupting memory but I personally find this
> > > approach disappointing.  
> > 
> > Memory hotplug is the way to achieve variable density with assigned
> > device VMs, otherwise look towards approaches like mdev and shared
> > virtual addresses with PASID support.  We cannot shoehorn page faulting
> > without both hardware and software support.  Some class of "legacy"
> > device assignment will always have this incompatibility.  Thanks,
> > 
> > Alex  
> 
> I'm not sure I agree.
> 
> At least with VTD, it seems entirely possible to change e.g. a PMD
> atomically to point to a different set of PTEs, then flush.
> That will allow removing memory at high granularity for
> an arbitrary device without mdev or PASID dependency.
> 
> I suspect most IOMMUs are like this.
> 
> IIUC doing that within guest right now will cause a range to be unmapped
> and them mapped again, which I suspect only works if we are lucky and
> device does not access the range during this time.
> 
> So at some level it's a theoretical bug we would do well to fix,
> and then we can support ballooning better.

Being able to unmap the page atomically from the IOMMU is one aspect,
the other is re-mapping the page when the balloon is deflated, which is
currently done only via a page fault.  We cannot guarantee that a vCPU
will touch a page before the IO device does, so something needs to
fault in that page for the IOMMU.  So we have:

 - How do we handle re-mapping pages as the balloon is deflated?
   - IOMMU page faults?  Requires PRI, IOMMU & endpoint support.
   - Some new MMU notifier hook?  Not sure WILLNEED is appropriate here.

 - How do we handle un-mapping pages as the balloon is inflated?
   - Rewrite the kernel IOMMU API and IOMMU drivers to allow unmapping
     sub-pages within previous mappings.
   - MMU notifier hook to trigger above non-existent code?
   - Alternatively, sacrificing IOTLB performance and probably kernel
     bloat by using only PAGE_SIZE IOMMU mappings.

Maybe some of these will evolve over time, SVA efforts are working on
some of these interfaces, but apparently device assignment users have
been getting along just fine without ballooning for many years.  With
physical devices, or even modern VFs, it seems hard to push density
beyond what we can handle with memory hotplug.  Perhaps as we get into
scalable IOV type approaches we can opt-in more mediated devices by
default.  It seems like we're just going around in circles here though,
anything more than preventing QEMU from shooting itself is a long term
goal touching multiple levels of the stack. Thanks,

Alex



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux