On 2018-10-29 at 20:10 Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > > static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > > > gfn_t *gfnp, kvm_pfn_t *pfnp, > > > > int *levelp) > > > > @@ -3168,7 +3237,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, > > > > */ > > > > if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && > > > > level == PT_PAGE_TABLE_LEVEL && > > > > - PageTransCompoundMap(pfn_to_page(pfn)) && > > > > + pfn_is_pmd_mapped(vcpu->kvm, gfn, pfn) && > > > > > > I'm wondering if we're adding an explicit is_zone_device_page() check > > > in this path to determine the page mapping size if that can be a > > > replacement for the kvm_is_reserved_pfn() check. In other words, the > > > goal of fixing up PageReserved() was to preclude the need for DAX-page > > > special casing in KVM, but if we already need add some special casing > > > for page size determination, might as well bypass the > > > kvm_is_reserved_pfn() dependency as well. > > > > kvm_is_reserved_pfn() is used in some other places, like > > kvm_set_pfn_dirty()and kvm_set_pfn_accessed(). Maybe the way those > > treat DAX pages matters on a case-by-case basis? > > > > There are other callers of kvm_is_reserved_pfn() such as > > kvm_pfn_to_page() and gfn_to_page(). I'm not familiar (yet) with how > > struct pages and DAX work together, and whether or not the callers of > > those pfn_to_page() functions have expectations about the 'type' of > > struct page they get back. > > > > The property of DAX pages that requires special coordination is the > fact that the device hosting the pages can be disabled at will. The > get_dev_pagemap() api is the interface to pin a device-pfn so that you > can safely perform a pfn_to_page() operation. > > Have the pages that kvm uses in this path already been pinned by vfio? I'm not aware of any explicit pinning, but it might be happening under the hood. These pages are just generic guest RAM, but they are present in a host-side mapping. I ran into this when looking at EPT fault handling. In the code I changed, a physical page was faulted in to the task's page table, then while the kvm->mmu_lock is held, KVM makes an EPT mapping to the same physical page. That mmu_lock seems to prevent any concurrent host-side unmappings; though I'm not familiar with the mm notifier stuff. One usage of kvm_is_reserved_pfn() in KVM code is like this: static struct page *kvm_pfn_to_page(kvm_pfn_t pfn) { if (is_error_noslot_pfn(pfn)) return KVM_ERR_PTR_BAD_PAGE; if (kvm_is_reserved_pfn(pfn)) { WARN_ON(1); return KVM_ERR_PTR_BAD_PAGE; } return pfn_to_page(pfn); } I think there's no guarantee the kvm->mmu_lock is held in the generic case. Here's one case where it wasn't (from walking through the code): handle_exception -handle_ud --kvm_emulate_instruction ---x86_emulate_instruction ----x86_emulate_insn -----writeback ------segmented_cmpxchg -------emulator_cmpxchg_emulated --------kvm_vcpu_gfn_to_page ---------kvm_pfn_to_page There are probably other rules related to gfn_to_page that keep the page alive, maybe just during interrupt/vmexit context? Whatever keeps those pages alive for normal memory might grab that devmap reference under the hood for DAX mappings. Thanks, Barret