On Tue, Aug 13, 2019 at 01:33:16PM -0600, Alex Williamson wrote: > On Tue, 13 Aug 2019 11:57:37 -0600 > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > Could it be something with the gfn test: > > if (sp->gfn != gfn) > continue; > > If I remove it, I can't trigger the misbehavior. If I log it, I only > get hits on VM boot/reboot and some of the gfns look suspiciously like > they could be the assigned GPU BARs and maybe MSI mappings: > > (sp->gfn) != (gfn) Hits at boot/reboot makes sense, memslots get zapped when userspace removes a memory region/slot, e.g. remaps BARs and whatnot. ... > Is this gfn optimization correct? Overzealous? Doesn't account > correctly for something about MMIO mappings? Thanks, Yes? Shadow pages are stored in a hash table, for_each_valid_sp() walks all entries for a given gfn. The sp->gfn check is there to skip entries that hashed to the same list but for a completely different gfn. Skipping the gfn check would be sort of a lightweight zap all in the sense that it would zap shadow pages that happend to collide with the target memslot/gfn but are otherwise unrelated. What happens if you give just the GPU BAR at 0x80000000 a pass, i.e.: if (sp->gfn != gfn && sp->gfn != 0x80000) continue; If that doesn't work, it might be worth trying other gfns to see if you can pinpoint which sp is being zapped as collateral damage. It's possible there is a pre-existing bug somewhere else that was being hidden because KVM was effectively zapping all SPTEs during (re)boot, and the hash collision is also hiding the bug by zapping the stale entry. Of course it's also possible this code is wrong, :-)