On Wed, Dec 07, 2022 at 04:12:31PM -0800, John Hubbard wrote: > On 12/7/22 12:31, Peter Xu wrote: > > huge_pte_offset() is the main walker function for hugetlb pgtables. The > > name is not really representing what it does, though. > > > > Instead of renaming it, introduce a wrapper function called hugetlb_walk() > > which will use huge_pte_offset() inside. Assert on the locks when walking > > the pgtable. > > > > Note, the vma lock assertion will be a no-op for private mappings. > > > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx> > > --- > > fs/hugetlbfs/inode.c | 4 +--- > > fs/userfaultfd.c | 6 ++---- > > include/linux/hugetlb.h | 39 +++++++++++++++++++++++++++++++++++++++ > > mm/hugetlb.c | 32 +++++++++++++------------------- > > mm/page_vma_mapped.c | 2 +- > > mm/pagewalk.c | 4 +--- > > 6 files changed, 57 insertions(+), 30 deletions(-) > > > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > index fdb16246f46e..48f1a8ad2243 100644 > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -388,9 +388,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma, > > { > > pte_t *ptep, pte; > > - ptep = huge_pte_offset(vma->vm_mm, addr, > > - huge_page_size(hstate_vma(vma))); > > - > > + ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma))); > > if (!ptep) > > return false; > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > index a602f008dde5..f31fe1a9f4c5 100644 > > --- a/fs/userfaultfd.c > > +++ b/fs/userfaultfd.c > > @@ -237,14 +237,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx, > > unsigned long flags, > > unsigned long reason) > > { > > - struct mm_struct *mm = ctx->mm; > > pte_t *ptep, pte; > > bool ret = true; > > - mmap_assert_locked(mm); > > - > > - ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma)); > > + mmap_assert_locked(ctx->mm); > > + ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma)); > > if (!ptep) > > goto out; > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index 81efd9b9baa2..1c20cbbf3d22 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -2,6 +2,7 @@ > > #ifndef _LINUX_HUGETLB_H > > #define _LINUX_HUGETLB_H > > +#include <linux/mm.h> > > #include <linux/mm_types.h> > > #include <linux/mmdebug.h> > > #include <linux/fs.h> > > @@ -196,6 +197,11 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, > > * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE. > > * Returns the pte_t* if found, or NULL if the address is not mapped. > > * > > + * IMPORTANT: we should normally not directly call this function, instead > > + * this is only a common interface to implement arch-specific walker. > > + * Please consider using the hugetlb_walk() helper to make sure of the > > + * correct locking is satisfied. > > Or: > > "Please use hugetlb_walk() instead, because that will attempt to verify > the locking for you." Ok. > > > + * > > * Since this function will walk all the pgtable pages (including not only > > * high-level pgtable page, but also PUD entry that can be unshared > > * concurrently for VM_SHARED), the caller of this function should be > > @@ -1229,4 +1235,37 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr); > > #define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end) > > #endif > > +static inline bool > > +__vma_shareable_flags_pmd(struct vm_area_struct *vma) > > +{ > > + return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) && > > + vma->vm_private_data; > > +} > > + > > +/* > > + * Safe version of huge_pte_offset() to check the locks. See comments > > + * above huge_pte_offset(). > > + */ > > It is odd to say that functionA() is a safe version of functionB(), if the > names are completely different. > > At this point, it is very clear that huge_pte_offset() should be renamed. > I'd suggest something like one of these: > > __hugetlb_walk() > hugetlb_walk_raw() We can. Not only because that's an arch api for years (didn't want to touch more arch code unless necessary), but also since we have hugetlb_walk() that'll be the future interface not huge_pte_offset(). Actually it's good when that's the only thing people can find from its name when they want to have a huge pgtable walk. :) So totally makes sense to do so, but I don't strongly feel like doing it in this patchset if you're okay with it. > > > +static inline pte_t * > > +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz) > > +{ > > +#if defined(CONFIG_HUGETLB_PAGE) && \ > > + defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP) > > + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; > > + > > + /* > > + * If pmd sharing possible, locking needed to safely walk the > > + * hugetlb pgtables. More information can be found at the comment > > + * above huge_pte_offset() in the same file. > > + * > > + * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP. > > + */ > > + if (__vma_shareable_flags_pmd(vma)) > > + WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) && > > + !lockdep_is_held( > > + &vma->vm_file->f_mapping->i_mmap_rwsem)); > > +#endif > > + return huge_pte_offset(vma->vm_mm, addr, sz); > > +} > > Let's please not slice up C functions with ifdefs. Instead, stick to the > standard approach of > > #ifdef X > functionC() > { > ...implementation > } > #else > functionC() > { > ...simpler or shorter or stub implementation > } Personally I like the slicing because it clearly tells what's the difference with/without the macros defined. Thanks, -- Peter Xu