On Mon, Oct 29, 2018 at 2:07 PM Barret Rhoden <brho@xxxxxxxxxx> wrote: > > This change allows KVM to map DAX-backed files made of huge pages with > huge mappings in the EPT/TDP. > > DAX pages are not PageTransCompound. The existing check is trying to > determine if the mapping for the pfn is a huge mapping or not. For > non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. > > For DAX, we can check the page table itself. Actually, we might always > be able to walk the page table, even for PageTransCompound pages, but > it's probably a little slower. > > Note that KVM already faulted in the page (or huge page) in the host's > page table, and we hold the KVM mmu spinlock (grabbed before checking > the mmu seq). Based on the other comments about not worrying about a > pmd split, we might be able to safely walk the page table without > holding the mm sem. > > This patch relies on kvm_is_reserved_pfn() being false for DAX pages, > which I've hacked up for testing this code. That change should > eventually happen: > > https://lore.kernel.org/lkml/20181022084659.GA84523@tiger-server/ > > Another issue is that kvm_mmu_zap_collapsible_spte() also uses > PageTransCompoundMap() to detect huge pages, but we don't have a way to > get the HVA easily. Can we just aggressively zap DAX pages there? > > Alternatively, is there a better way to track at the struct page level > whether or not a page is huge-mapped? Maybe the DAX huge pages mark > themselves as TransCompound or something similar, and we don't need to > special case DAX/ZONE_DEVICE pages. > > Signed-off-by: Barret Rhoden <brho@xxxxxxxxxx> > --- > arch/x86/kvm/mmu.c | 71 +++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 70 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index cf5f572f2305..9f3e0f83a2dd 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -3152,6 +3152,75 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) > return -EFAULT; > } > > +static unsigned long pgd_mapping_size(struct mm_struct *mm, unsigned long addr) > +{ > + pgd_t *pgd; > + p4d_t *p4d; > + pud_t *pud; > + pmd_t *pmd; > + pte_t *pte; > + > + pgd = pgd_offset(mm, addr); > + if (!pgd_present(*pgd)) > + return 0; > + > + p4d = p4d_offset(pgd, addr); > + if (!p4d_present(*p4d)) > + return 0; > + if (p4d_huge(*p4d)) > + return P4D_SIZE; > + > + pud = pud_offset(p4d, addr); > + if (!pud_present(*pud)) > + return 0; > + if (pud_huge(*pud)) > + return PUD_SIZE; > + > + pmd = pmd_offset(pud, addr); > + if (!pmd_present(*pmd)) > + return 0; > + if (pmd_huge(*pmd)) > + return PMD_SIZE; > + > + pte = pte_offset_map(pmd, addr); > + if (!pte_present(*pte)) > + return 0; > + return PAGE_SIZE; > +} > + > +static bool pfn_is_pmd_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) > +{ > + struct page *page = pfn_to_page(pfn); > + unsigned long hva, map_sz; > + > + if (!is_zone_device_page(page)) > + return PageTransCompoundMap(page); > + > + /* > + * DAX pages do not use compound pages. The page should have already > + * been mapped into the host-side page table during try_async_pf(), so > + * we can check the page tables directly. > + */ > + hva = gfn_to_hva(kvm, gfn); > + if (kvm_is_error_hva(hva)) > + return false; > + > + /* > + * Our caller grabbed the KVM mmu_lock with a successful > + * mmu_notifier_retry, so we're safe to walk the page table. > + */ > + map_sz = pgd_mapping_size(current->mm, hva); > + switch (map_sz) { > + case PMD_SIZE: > + return true; > + case P4D_SIZE: > + case PUD_SIZE: > + printk_once(KERN_INFO "KVM THP promo found a very large page"); Why not allow PUD_SIZE? The device-dax interface supports PUD mappings. > + return false; > + } > + return false; > +} The above 2 functions are similar to what we need to do for determining the blast radius of a memory error, see dev_pagemap_mapping_shift() and its usage in add_to_kill(). > + > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > gfn_t *gfnp, kvm_pfn_t *pfnp, > int *levelp) > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > */ > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > level == PT_PAGE_TABLE_LEVEL && > - PageTransCompoundMap(pfn_to_page(pfn)) && > + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) && I'm wondering if we're adding an explicit is_zone_device_page() check in this path to determine the page mapping size if that can be a replacement for the kvm_is_reserved_pfn() check. In other words, the goal of fixing up PageReserved() was to preclude the need for DAX-page special casing in KVM, but if we already need add some special casing for page size determination, might as well bypass the kvm_is_reserved_pfn() dependency as well.