On 06.11.19 01:08, Dan Williams wrote:
On Tue, Nov 5, 2019 at 4:03 PM Sean Christopherson
<sean.j.christopherson@xxxxxxxxx> wrote:
On Tue, Nov 05, 2019 at 03:43:29PM -0800, Dan Williams wrote:
On Tue, Nov 5, 2019 at 3:30 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
On Tue, Nov 5, 2019 at 3:13 PM Sean Christopherson
<sean.j.christopherson@xxxxxxxxx> wrote:
On Tue, Nov 05, 2019 at 03:02:40PM -0800, Dan Williams wrote:
On Tue, Nov 5, 2019 at 12:31 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
The scarier code (for me) is transparent_hugepage_adjust() and
kvm_mmu_zap_collapsible_spte(), as I don't at all understand the
interaction between THP and _PAGE_DEVMAP.
The x86 KVM MMU code is one of the ugliest code I know (sorry, but it
had to be said :/ ). Luckily, this should be independent of the
PG_reserved thingy AFAIKs.
Both transparent_hugepage_adjust() and kvm_mmu_zap_collapsible_spte()
are honoring kvm_is_reserved_pfn(), so again I'm missing where the
page count gets mismanaged and leads to the reported hang.
When mapping pages into the guest, KVM gets the page via gup(), which
increments the page count for ZONE_DEVICE pages. But KVM puts the page
using kvm_release_pfn_clean(), which skips put_page() if PageReserved()
and so never puts its reference to ZONE_DEVICE pages.
Oh, yeah, that's busted.
Ugh, it's extra busted because every other gup user in the kernel
tracks the pages resulting from gup and puts them (put_page()) when
they are done. KVM wants to forget about whether it did a gup to get
the page and optionally trigger put_page() based purely on the pfn.
Outside of VFIO device assignment that needs pages pinned for DMA, why
does KVM itself need to pin pages? If pages are pinned over a return
to userspace that needs to be a FOLL_LONGTERM gup.
Short answer, KVM pins the page to ensure correctness with respect to the
primary MMU invalidating the associated host virtual address, e.g. when
the page is being migrated or unmapped from host userspace.
The main use of gup() is to handle guest page faults and map pages into
the guest, i.e. into KVM's secondary MMU. KVM uses gup() to both get the
PFN and to temporarily pin the page. The pin is held just long enough to
guaranteed that any invalidation via the mmu_notifier will be stalled
until after KVM finishes installing the page into the secondary MMU, i.e.
the pin is short-term and not held across a return to userspace or entry
into the guest. When a subsequent mmu_notifier invalidation occurs, KVM
pulls the PFN from the secondary MMU and uses that to update accessed
and dirty bits in the host.
There are a few other KVM flows that eventually call into gup(), but those
are "traditional" short-term pins and use put_page() directly.
Ok, I was misinterpreting the effect of the bug with what KVM is using
the reference to do.
To your other point:
But David's proposed fix for the above refcount bug is to omit the patch
so that KVM no longer treats ZONE_DEVICE pages as reserved. That seems
like the right thing to do, including for thp_adjust(), e.g. it would
naturally let KVM use 2mb pages for the guest when a ZONE_DEVICE page is
mapped with a huge page (2mb or above) in the host. The only hiccup is
figuring out how to correctly transfer the reference.
That might not be the only hiccup. There's currently no such thing as
huge pages for ZONE_DEVICE, there are huge *mappings* (pmd and pud),
but the result of pfn_to_page() on such a mapping does not yield a
huge 'struct page'. It seems there are other paths in KVM that assume
that more typical page machinery is active like SetPageDirty() based
on kvm_is_reserved_pfn(). While I told David that I did not want to
see more usage of is_zone_device_page(), this patch below (untested)
seems a cleaner path with less surprises:
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4df0aa6b8e5c..fbea17c1810c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1831,7 +1831,8 @@ EXPORT_SYMBOL_GPL(kvm_release_page_clean);
void kvm_release_pfn_clean(kvm_pfn_t pfn)
{
- if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn))
+ if ((!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn)) ||
+ (pfn_valid(pfn) && is_zone_device_page(pfn_to_page(pfn))))
put_page(pfn_to_page(pfn));
}
EXPORT_SYMBOL_GPL(kvm_release_pfn_clean);
I had the same thought, but I do wonder about the kvm_get_pfn() users,
e.g.,:
hva_to_pfn_remapped():
r = follow_pfn(vma, addr, &pfn);
...
kvm_get_pfn(pfn);
...
We would not take a reference for ZONE_DEVICE, but later drop one
reference via kvm_release_pfn_clean(). IOW, kvm_get_pfn() gets *really*
dangerous to use. I can't tell if this can happen right now.
We do have 3 users of kvm_get_pfn() that we have to audit before this
change. Also, we should add a comment to kvm_get_pfn() that it should
never be used with possible ZONE_DEVICE pages.
Also, we should add a comment to kvm_release_pfn_clean(), describing why
we treat ZONE_DEVICE in a special way here.
We can then progress like this
1. Get this fix upstream, it's somewhat unrelated to this series.
2. This patch here remains as is in this series (+/- documentation update)
3. Long term, rework KVM to not have to not treat ZONE_DEVICE like
reserved pages. E.g., get rid of kvm_get_pfn(). Then, this special zone
check can go.
--
Thanks,
David / dhildenb