Re: THP refcounting in disallowed_hugepage_adjust()?

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Tue, 26 Nov 2019 09:46:03 -0800

On Tue, Nov 26, 2019 at 04:21:09PM +0100, Joerg Roedel wrote:
> Hi Paolo et al,
> 
> while looking again at the recently added IFU patches I noticed a
> dicrepancy between the two _hugepage_adjust() functions which doesn't
> make sense to me yet:
> 
> 	* transparent_hugepage_adjust(), when changing the value of pfn,
> 	  does a kvm_release_pfn_clean() on the old value and a
> 	  kvm_get_pfn() on the new value to make sure the code holds the
> 	  reference to the correct pfn.
> 
> 	* disallowed_hugepage_adjust() also changes the value of the pfn
> 	  to map, kinda reverses what transparent_hugepage_adjust() did
> 	  before. But that function does not care about the pfn
> 	  refcounting.
> 
> I was wondering what the reason for that might be, is it just not
> necessary in disallowed_hugepage_adjust() or is that an oversight?

The page fault flows don't actually rely on holding a reference to the
page once they reach thp_adjust().  At that point, they hold mmu_lock and
have verified no invalidation from mmu_notifier have occured since the
page reference was acquired. 

The release/get pfn dance in transparent_hugepage_adjust() is a quirk of
sorts that is necessitated because thp_adjust() modifies the local @pfn
variable in FNAME(page_fault)(), nonpaging_map() and tdp_page_fault().
Those functions all call kvm_release_pfn_clean() on @pfn, so thp_adjust()
needs to transfer the page reference purely for correctness when the pfn
is released.

disallowed_hugepage_adjust() is called from __direct_map() and its
modification of the pfn is also contained to __direct_map(), i.e. the
updated @pfn doesn't get propagated back up to the fault handlers.  Thus,
kvm_release_pfn_clean() is called on the original pfn and so there's no
need to transfer the page reference.

The above discrepancy can resolved by moving thp_adjust() into FNAME(fetch)
and __direct_map() so that the "top-level" @pfn isn't modified.  Getting
rid of the kvm_get_pfn() call would be a nice side effect.  The downside is
that @force_pt_level would need to be passed down the call stack.