On Tue, Nov 26, 2019 at 04:21:09PM +0100, Joerg Roedel wrote: > Hi Paolo et al, > > while looking again at the recently added IFU patches I noticed a > dicrepancy between the two _hugepage_adjust() functions which doesn't > make sense to me yet: > > * transparent_hugepage_adjust(), when changing the value of pfn, > does a kvm_release_pfn_clean() on the old value and a > kvm_get_pfn() on the new value to make sure the code holds the > reference to the correct pfn. > > * disallowed_hugepage_adjust() also changes the value of the pfn > to map, kinda reverses what transparent_hugepage_adjust() did > before. But that function does not care about the pfn > refcounting. > > I was wondering what the reason for that might be, is it just not > necessary in disallowed_hugepage_adjust() or is that an oversight? The page fault flows don't actually rely on holding a reference to the page once they reach thp_adjust(). At that point, they hold mmu_lock and have verified no invalidation from mmu_notifier have occured since the page reference was acquired. The release/get pfn dance in transparent_hugepage_adjust() is a quirk of sorts that is necessitated because thp_adjust() modifies the local @pfn variable in FNAME(page_fault)(), nonpaging_map() and tdp_page_fault(). Those functions all call kvm_release_pfn_clean() on @pfn, so thp_adjust() needs to transfer the page reference purely for correctness when the pfn is released. disallowed_hugepage_adjust() is called from __direct_map() and its modification of the pfn is also contained to __direct_map(), i.e. the updated @pfn doesn't get propagated back up to the fault handlers. Thus, kvm_release_pfn_clean() is called on the original pfn and so there's no need to transfer the page reference. The above discrepancy can resolved by moving thp_adjust() into FNAME(fetch) and __direct_map() so that the "top-level" @pfn isn't modified. Getting rid of the kvm_get_pfn() call would be a nice side effect. The downside is that @force_pt_level would need to be passed down the call stack.