Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 15 Aug 2019 14:11:12 -0600

On Thu, 15 Aug 2019 12:25:31 -0700
Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:

> On Thu, Aug 15, 2019 at 12:16:07PM -0600, Alex Williamson wrote:
> > On Thu, 15 Aug 2019 09:00:06 -0700
> > Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
> > 
> > If I print out the memslot base_gfn, it seems pretty evident that only
> > the assigned device mappings are triggering this branch.  The base_gfns
> > exclusively include:
> > 
> >  0x800000
> >  0x808000
> >  0xc0089
> > 
> > Where the first two clearly match the 64bit BARs and the last is the
> > result of a page that we need to emulate within the BAR @0xc0000000 at
> > offset 0x88000, so the base_gfn is the remaining direct mapping.  
> 
> That's consistent with my understanding of userspace, e.g. normal memory
> regions aren't deleted until the VM is shut down (barring hot unplug).
> 
> > I don't know if this implies we're doing something wrong for assigned
> > device slots, but maybe a more targeted workaround would be if we could
> > specifically identify these slots, though there's no special
> > registration of them versus other slots.  
> 
> What is triggering the memslot removal/update?  Is it possible that
> whatever action is occuring is modifying multiple memslots?  E.g. KVM's
> memslot-only zapping is allowing the guest to access stale entries for
> the unzapped-but-related memslots, whereas the full zap does not.
> 
> FYI, my VFIO/GPU/PCI knowledge is abysmal, please speak up if any of my
> ideas are nonsensical.

The memory bit in the PCI command register of config space for each
device controls whether the device decoders are active for the MMIO BAR
ranges.  These get flipped as both the guest firmware and guest OS
enumerate and assign resources to the PCI subsystem.  Generally these
are not further manipulated while the guest OS is running except for
hotplug operations.  The guest OS device driver will generally perform
the final enable of these ranges and they'll remain enabled until the
guest is rebooted.

I recall somewhere in this thread you referenced reading the ROM as
part of the performance testing of this series.  The ROM has it's own
enable bit within the ROM BAR as the PCI spec allows devices to share
decoders between the standard BARs and the ROM BAR.  Enabling and
disabling the enable bit in the ROM BAR should be very similar in
memslot behavior to the overall memory enable bit for the other BARs
within the device.

Note that often the faults that I'm seeing occur a long time after BAR
mappings are finalized, usually (not always) the VM boots to a fully
functional desktop and it's only as I run various tests do the glitches
start to appear.  For instance, when I allowed sp->gfn 0xfec00 to take
the continue branch, I got an OpenCL error.  For either 0xffee00 or
0xc1000 I got graphics glitches, for example stray geometric artifacts
flashed on the screen.  For 0x100000 and 0x800000 I'd get a black
screen or blank 3D graphics window.  For 0x80a000 the VM froze
(apparently).  I can't say whether each of these is a consistent failure
mode, I only tested to the point of determining whether a range
generates an error.

> > Did you have any non-device
> > assignment test cases that took this branch when developing the series?  
> 
> The primary testing was performance oriented, using a slightly modified
> version of a synthetic benchmark[1] from a previous series[2] that touched
> the memslot flushing flow.  From a functional perspective, I highly doubt
> that test would have been able expose an improper zapping bug.

:-\

It seems like there's probably some sort of inflection point where it
becomes faster to zap all pages versus the overhead of walking every
page in a memory slot, was that evaluated?  Not sure if that's relevant
here, but curious.  Thanks,

Alex