On Fri, Aug 18, 2017 at 02:29:18PM -0700, Mike Kravetz wrote: > On 08/18/2017 07:54 AM, Punit Agrawal wrote: > > When walking the page tables to resolve an address that points to > > !p*d_present() entry, huge_pte_offset() returns inconsistent values > > depending on the level of page table (PUD or PMD). > > > > It returns NULL in the case of a PUD entry while in the case of a PMD > > entry, it returns a pointer to the page table entry. > > > > A similar inconsitency exists when handling swap entries - returns NULL > > for a PUD entry while a pointer to the pte_t is retured for the PMD entry. > > > > Update huge_pte_offset() to make the behaviour consistent - return a > > pointer to the pte_t for hugepage or swap entries. Only return NULL in > > instances where we have a p*d_none() entry and the size parameter > > doesn't match the hugepage size at this level of the page table. > > > > Document the behaviour to clarify the expected behaviour of this function. > > This is to set clear semantics for architecture specific implementations > > of huge_pte_offset(). > > > > Signed-off-by: Punit Agrawal <punit.agrawal@xxxxxxx> > > Cc: Catalin Marinas <catalin.marinas@xxxxxxx> > > Cc: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> > > Cc: Steve Capper <steve.capper@xxxxxxx> > > Cc: Will Deacon <will.deacon@xxxxxxx> > > Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> > > Cc: Michal Hocko <mhocko@xxxxxxxx> > > Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> > > --- > > > > Hi Andrew, > > > > From discussions on the arm64 implementation of huge_pte_offset()[0] > > we realised that there is benefit from returning a pte_t* in the case > > of p*d_none(). > > > > The fault handling code in hugetlb_fault() can handle p*d_none() > > entries and saves an extra round trip to huge_pte_alloc(). Other > > callers of huge_pte_offset() should be ok as well. > > Yes, this change would eliminate that call to huge_pte_alloc() in > hugetlb_fault(). However, huge_pte_offset() is now returning a pointer > to a p*d_none() pte in some instances where it would have previously > returned NULL. Correct? Yes (whether it was previously the right thing to return is a different matter; that's what we are trying to clarify in the generic code so that we can have similar semantics on arm64). > I went through the callers, and like you am fairly confident that they > can handle this situation. But, returning p*d_none() instead of NULL > does change the execution path in several routines such as > copy_hugetlb_page_range, __unmap_hugepage_range hugetlb_change_protection, > and follow_hugetlb_page. If huge_pte_alloc() returns NULL to these > routines, they do a quick continue, exit, etc. If they are returned > a pointer, they typically lock the page table(s) and then check for > p*d_none() before continuing, exiting, etc. So, it appears that these > routines could potentially slow down a bit with this change (in the specific > case of p*d_none). Arguably (well, my interpretation), it should return a NULL only if the entry is a table entry, potentially pointing to a next level (pmd). In the pud case, this means that sz < PUD_SIZE. If the pud is a last level huge page entry (either present or !present), huge_pte_offset() should return the pointer to it and never NULL. If the entry is a swap or migration one (pte_present() == false) with the current code we don't even enter the corresponding checks in copy_hugetlb_page_range(). I also assume that the ptl __unmap_hugepage_range() is taken to avoid some race when the entry is a huge page (present or not). If such race doesn't exist, we could as well check the huge_pte_none() outside the locked region (which is what the current huge_pte_offset() does with !pud_present()). IMHO, while the current generic huge_pte_offset() avoids some code paths in the functions you mentioned, the results are not always correct (missing swap/migration entries or potentially racy). -- Catalin