On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote: > > On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote: > > On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > > > > On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson > > > <sean.j.christopherson@xxxxxxxxx> wrote: > > > > > > > > On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote: > > > > > On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > > > > The scarier code (for me) is transparent_hugepage_adjust() and > > > > > > > kvm_mmu_zap_collapsible_spte(), as I don't at all understand the > > > > > > > interaction between THP and _PAGE_DEVMAP. > > > > > > > > > > > > The x86 KVM MMU code is one of the ugliest code I know (sorry, but it > > > > > > had to be said :/ ). Luckily, this should be independent of the > > > > > > PG_reserved thingy AFAIKs. > > > > > > > > > > Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte() > > > > > are honoring kvm_is_reserved_pfn(), so again I'm missing where the > > > > > page count gets mismanaged and leads to the reported hang. > > > > > > > > When mapping pages into the guest, KVM gets the page via gup(), which > > > > increments the page count for ZONE_DEVICE pages. But KVM puts the page > > > > using kvm_release_pfn_clean(), which skips put_page() if PageReserved() > > > > and so never puts its reference to ZONE_DEVICE pages. > > > > > > Oh, yeah, that's busted. > > > > Ugh, it's extra busted because every other gup user in the kernel > > tracks the pages resulting from gup and puts them (put_page()) when > > they are done. KVM wants to forget about whether it did a gup to get > > the page and optionally trigger put_page() based purely on the pfn. > > Outside of VFIO device assignment that needs pages pinned for DMA, why > > does KVM itself need to pin pages? If pages are pinned over a return > > to userspace that needs to be a FOLL_LONGTERM gup. > > Short answer, KVM pins the page to ensure correctness with respect to the > primary MMU invalidating the associated host virtual address, e.g. when > the page is being migrated or unmapped from host userspace. > > The main use of gup() is to handle guest page faults and map pages into > the guest, i.e. into KVM's secondary MMU. KVM uses gup() to both get the > PFN and to temporarily pin the page. The pin is held just long enough to > guaranteed that any invalidation via the mmu_notifier will be stalled > until after KVM finishes installing the page into the secondary MMU, i.e. > the pin is short-term and not held across a return to userspace or entry > into the guest. When a subsequent mmu_notifier invalidation occurs, KVM > pulls the PFN from the secondary MMU and uses that to update accessed > and dirty bits in the host. > > There are a few other KVM flows that eventually call into gup(), but those > are "traditional" short-term pins and use put_page() directly. Ok, I was misinterpreting the effect of the bug with what KVM is using the reference to do. To your other point: > But David's proposed fix for the above refcount bug is to omit the patch > so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems > like the right thing to do, including for thp_adjust(), e.g. it would > naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is > mapped with a huge page (2mb or above) in the host. The only hiccup is > figuring out how to correctly transfer the reference. That might not be the only hiccup. There's currently no such thing as huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud), but the result of pfn_to_page() on such a mapping does not yield a huge 'struct page'. It seems there are other paths in KVM that assume that more typical page machinery is active like SetPageDirty() based on kvm_is_reserved_pfn(). While I told David that I did not want to see more usage of is_zone_device_page(), this patch below (untested) seems a cleaner path with less surprises: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4df0aa6b8e5c..fbea17c1810c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean); void kvm_release_pfn_clean(kvm_pfn_t pfn) { - if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) + if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) || + (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn)))) put_page(pfn_to_page(pfn)); } EXPORT_SYMBOL_GPL(kvm_release_pfn_clean); This is safe because the reference that KVM took earlier protects the is_zone_device_page() lookup from racing device teardown. Otherwise, if KVM does not have a reference it's unsafe, but that's already even more broken because KVM would be releasing a page that it never referenced. Every other KVM operation that assumes page allocator pages would continue to honor kvm_is_reserved_pfn(). _______________________________________________ devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxx http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel