On Tue, Aug 07, 2018 at 10:35:05AM -0600, Alex Williamson wrote: > On Tue, 7 Aug 2018 21:10:21 +0800 > Peter Xu <peterx@xxxxxxxxxx> wrote: > > > On Mon, Jul 30, 2018 at 05:14:21PM -0600, Alex Williamson wrote: > > > We use a VFIOContainer to associate an AddressSpace to one or more > > > VFIOGroups. The VFIOContainer represents the DMA context for that > > > AdressSpace for those VFIOGroups and is synchronized to changes in > > > that AddressSpace via a MemoryListener. For IOMMU backed devices, > > > maintaining the DMA context for a VFIOGroup generally involves > > > pinning a host virtual address in order to create a stable host > > > physical address and then mapping a translation from the associated > > > guest physical address to that host physical address into the IOMMU. > > > > > > While the above maintains the VFIOContainer synchronized to the QEMU > > > memory API of the VM, memory ballooning occurs outside of that API. > > > Inflating the memory balloon (ie. cooperatively capturing pages from > > > the guest for use by the host) simply uses MADV_DONTNEED to "zap" > > > pages from QEMU's host virtual address space. The page pinning and > > > IOMMU mapping above remains in place, negating the host's ability to > > > reuse the page, but the host virtual to host physical mapping of the > > > page is invalidated outside of QEMU's memory API. > > > > > > When the balloon is later deflated, attempting to cooperatively > > > return pages to the guest, the page is simply freed by the guest > > > balloon driver, allowing it to be used in the guest and incurring a > > > page fault when that occurs. The page fault maps a new host physical > > > page backing the existing host virtual address, meanwhile the > > > VFIOContainer still maintains the translation to the original host > > > physical address. At this point the guest vCPU and any assigned > > > devices will map different host physical addresses to the same guest > > > physical address. Badness. > > > > > > The IOMMU typically does not have page level granularity with which > > > it can track this mapping without also incurring inefficiencies in > > > using page size mappings throughout. MMU notifiers in the host > > > kernel also provide indicators for invalidating the mapping on > > > balloon inflation, not for updating the mapping when the balloon is > > > deflated. For these reasons we assume a default behavior that the > > > mapping of each VFIOGroup into the VFIOContainer is incompatible > > > with memory ballooning and increment the balloon inhibitor to match > > > the attached VFIOGroups. > > > > > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > > --- > > > hw/vfio/common.c | 5 +++++ > > > 1 file changed, 5 insertions(+) > > > > > > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > > > index fb396cf00ac4..4881b691a659 100644 > > > --- a/hw/vfio/common.c > > > +++ b/hw/vfio/common.c > > > @@ -32,6 +32,7 @@ > > > #include "hw/hw.h" > > > #include "qemu/error-report.h" > > > #include "qemu/range.h" > > > +#include "sysemu/balloon.h" > > > #include "sysemu/kvm.h" > > > #include "trace.h" > > > #include "qapi/error.h" > > > @@ -1049,6 +1050,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, > > > group->container = container; > > > QLIST_INSERT_HEAD(&container->group_list, group, container_next); > > > vfio_kvm_device_add_group(group); > > > + qemu_balloon_inhibit(true); > > > > [1] > > > > > return 0; > > > } > > > } > > > @@ -1198,6 +1200,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as, > > > } > > > > > > vfio_kvm_device_add_group(group); > > > + qemu_balloon_inhibit(true); > > > > AFAIU there is a very critical information that this > > qemu_balloon_inhibit() call must be before the call to: > > > > memory_listener_register(&container->listener, container->space->as); > > > > Since the memory listener registeration is the point when we do the > > pinning of the pages. So to make sure we won't have stale pages we > > must call qemu_balloon_inhibit() before memory_listener_register() > > (which is what this patch does). However this is not that obvious, > > not sure whether that might worth a comment. > > > > Considering this, not sure whether we can just do this per-container > > instead of per-group, then we also don't need to bother with extra > > group-add paths like [1]. > > > > No matter what, this patch looks good to me (and it is correct AFAIK), > > so I'm leaving r-b and I'll leave Alex to decide: > > Thanks Peter. I agree, I'll add more commentary. A minor correction, > we won't have "stale" pages at the time we pin, the act of pinning will > make those valid (same as I discussed in reply to mst why we don't need > to worry about pages ballooned before the device is added), but once a > page is pinned, we need to make sure it's not madvised dontneed, so we > need to be sure there's no possible race there, which effectively means > inhibiting before the memory listener can do any pinning. Yes. > > The reason I chose to inhibit per group is that it becomes easier to > allow endpoint drivers to opt-in. For instance if we could have ccw > and vfio-pci in the same VM, they would by default share a container. > If ccw releases the inhibit, we'd need to somehow reinstate it for the > vfio-pci device and remember which did what if one is hot unplugged. > Doing the inhibit at the group level resolves this, the ccw group adds > an inhibit by default, then releases it, the vfio-pci group adds an > inhibit and maintains it so long as attached. I struggled with whether > this should actually be a per-device inhibit, but then there's a gap > that the container listener is active before the device is retrieved, > so again the per-group inhibit was a better fit. Thanks, Thanks for explaining. I didn't look into the ccw patch, but it sounds reasonable to me now. Regards, -- Peter Xu