On Wed, 18 Jul 2018 14:48:03 +0800 Peter Xu <peterx@xxxxxxxxxx> wrote: > On Tue, Jul 17, 2018 at 04:47:31PM -0600, Alex Williamson wrote: > > Directly assigned vfio devices have never been compatible with > > ballooning. Zapping MADV_DONTNEED pages happens completely > > independent of vfio page pinning and IOMMU mapping, leaving us with > > inconsistent GPA to HPA mapping between vCPUs and assigned devices > > when the balloon deflates. Mediated devices can theoretically do > > better, if we make the assumption that the mdev vendor driver is fully > > synchronized to the actual working set of the guest driver. In that > > case the guest balloon driver should never be able to allocate an mdev > > pinned page for balloon inflation. Unfortunately, QEMU can't know the > > workings of the vendor driver pinning, and doesn't actually know the > > difference between mdev devices and directly assigned devices. Until > > we can sort out how the vfio IOMMU backend can tell us if ballooning > > is safe, the best approach is to disabling ballooning any time a vfio > > devices is attached. > > > > To do that, simply make the balloon inhibitor a counter rather than a > > boolean, fixup a case where KVM can then simply use the inhibit > > interface, and inhibit ballooning any time a vfio device is attached. > > I'm expecting we'll expose some sort of flag similar to > > KVM_CAP_SYNC_MMU from the vfio IOMMU for cases where we can resolve > > this. An addition we could consider here would be yet another device > > option for vfio, such as x-disable-balloon-inhibit, in case there are > > mdev devices that behave in a manner compatible with ballooning. > > > > Please let me know if this looks like a good idea. Thanks, > > IMHO patches 1-2 are good cleanup as standalone patches... > > I totally have no idea on whether people would like to use vfio-pci > and the balloon device at the same time. After all vfio-pci are > majorly for performance players, then I would vaguely guess that they > don't really care thin provisioning of memory at all, hence the usage > scenario might not exist much. Is that the major reason that we'd > just like to disable it (which makes sense to me)? Well, the major reason for disabling it is that it currently doesn't work and when the balloon is deflated, the device and vCPU are talking to different host pages for the same GPA for previously ballooned pages. Regardless of the amenability of device assignment to various usage scenarios, that's a bad thing. I guess most device assignment users have either realized this doesn't work and avoid it, or perhaps they have VMs tuned more for performance than density and (again) don't use ballooning. > I'm wondering what if want to do that somehow some day... Whether > it'll work if we just let vfio-pci devices to register some guest > memory invalidation hook (just like the iommu notifiers, but for guest > memory address space instead), then we map/unmap the IOMMU pages there > for vfio-pci device to make sure the inflated balloon pages are not > mapped and also make sure new pages are remapped with correct HPA > after deflated. This is a pure question out of my curiosity, and for > sure it makes little sense if the answer of the first question above > is positive. This is why I mention the KVM MMU synchronization flag above. KVM essentially had this same problem and fixed it with with MMU notifiers in the kernel. They expose that KVM has the capability of handling such a scenario via a feature flag. We can do the same with vfio. In scenarios where we're able to fix this, we could expose a flag on the container indicating support for the same sort of thing. There are a few complications to this support though. First ballooning works at page size granularity, but IOMMU mapping can make use of arbitrary superpage sizes and the IOMMU API only guarantees unmap granularity equal to the original mapping. Therefore we cannot unmap individual pages unless we require that all mappings through the IOMMU API are done with page granularity, precluding the use of superpages by the IOMMU and thereby inflicting higher IOTLB overhead. Unlike a CPU, we can't invalidate the mappings and fault them back in or halt the processor to make the page table updates appear atomic. The device is considered always running and interfering with that would likely lead to functional issues. Second MMU notifiers seem to provide invalidation, pte change notices, and page aging interfaces, so if a page is consumed by the balloon inflating, we can invalidate it (modulo the issues in the previous paragraph), but how do we re-populate the mapping through the IOMMU when the page is released as the balloon is deflated? KVM seems to do this by handling the page fault, but we don't really have that option for devices. If we try to solve this only for mdev devices, we can request invalidation down the vendor driver with page granularity and we could assume a vendor driver that's well synchronized with the working set of the device would re-request a page if it was previously invalidated and becomes part of the working set. But if we have that assumption, then we could also assume that such a vendor driver would never have a ballooning victim page in its working set and therefore we don't need to do anything. Unfortunately without an audit, we can't really know the behavior of the vendor driver. vfio-ccw might be an exception here since this entire class of devices doesn't really perform DMA and page pinning is done on a per transaction basis, aiui. The vIOMMU is yet another consideration as it can effectively define the working set for a device via the device AddressSpace. If a ballooned request does not fall within the AddressSpace of any assigned device, it would be safe to balloon the page. So long as we're not running in IOMMU passthrough mode, these should be distinctly separate sets, active DMA pages should not be ballooning targets. However, I believe the current state of vIOMMU with assigned devices is that it's functional, but not in any way performant for this scenario. We see massive performance degradation when trying to use vIOMMU for anything other than mostly static mappings, such as when using passthrough mode or using userspace drivers or nested guests with relatively static mappings. So I don't know that it's a worthwhile return on investment if we were to test whether a balloon victim page falls within a device's AddressSpace as a further level of granularity. Thanks, Alex