On 08/23/22 12:23, David Hildenbrand wrote: > On 23.08.22 12:02, Baolin Wang wrote: > > > > > > On 8/23/2022 4:29 PM, David Hildenbrand wrote: > >> On 23.08.22 09:50, Baolin Wang wrote: > >>> On some architectures (like ARM64), it can support CONT-PTE/PMD size > >>> hugetlb, which means it can support not only PMD/PUD size hugetlb > >>> (2M and 1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size > >>> specified. > >>> > >>> So when looking up a CONT-PTE size hugetlb page by follow_page(), it > >>> will use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE > >>> size hugetlb in follow_page_pte(). However this pte entry lock is incorrect > >>> for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to > >>> get the correct lock, which is mm->page_table_lock. > >>> > >>> That means the pte entry of the CONT-PTE size hugetlb under current > >>> pte lock is unstable in follow_page_pte(), we can continue to migrate > >>> or poison the pte entry of the CONT-PTE size hugetlb, which can cause > >>> some potential race issues, and following pte_xxx() validation is also > >>> unstable in follow_page_pte(), even though they are under the 'pte lock'. > >>> > >>> Moreover we should use huge_ptep_get() to get the pte entry value of > >>> the CONT-PTE size hugetlb, which already takes into account the subpages' > >>> dirty or young bits in case we missed the dirty or young state of the > >>> CONT-PTE size hugetlb. > >>> > >>> To fix above issues, introducing a new helper follow_huge_pte() to look > >>> up a CONT-PTE size hugetlb page, which uses huge_pte_lock() to get the > >>> correct pte entry lock to make the pte entry stable, as well as > >>> supporting non-present pte handling. > >>> > >>> Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> > >>> --- > >>> include/linux/hugetlb.h | 8 ++++++++ > >>> mm/gup.c | 11 ++++++++++ > >>> mm/hugetlb.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++ > >>> 3 files changed, 72 insertions(+) > >>> > >>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > >>> index 3ec981a..d491138 100644 > >>> --- a/include/linux/hugetlb.h > >>> +++ b/include/linux/hugetlb.h > >>> @@ -207,6 +207,8 @@ struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, > >>> struct page *follow_huge_pd(struct vm_area_struct *vma, > >>> unsigned long address, hugepd_t hpd, > >>> int flags, int pdshift); > >>> +struct page *follow_huge_pte(struct vm_area_struct *vma, unsigned long address, > >>> + pmd_t *pmd, int flags); > >>> struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, > >>> pmd_t *pmd, int flags); > >>> struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address, > >>> @@ -312,6 +314,12 @@ static inline struct page *follow_huge_pd(struct vm_area_struct *vma, > >>> return NULL; > >>> } > >>> > >>> +static inline struct page *follow_huge_pte(struct vm_area_struct *vma, > >>> + unsigned long address, pmd_t *pmd, int flags) > >>> +{ > >>> + return NULL; > >>> +} > >>> + > >>> static inline struct page *follow_huge_pmd(struct mm_struct *mm, > >>> unsigned long address, pmd_t *pmd, int flags) > >>> { > >>> diff --git a/mm/gup.c b/mm/gup.c > >>> index 3b656b7..87a94f5 100644 > >>> --- a/mm/gup.c > >>> +++ b/mm/gup.c > >>> @@ -534,6 +534,17 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, > >>> if (unlikely(pmd_bad(*pmd))) > >>> return no_page_table(vma, flags); > >>> > >>> + /* > >>> + * Considering PTE level hugetlb, like continuous-PTE hugetlb on > >>> + * ARM64 architecture. > >>> + */ > >>> + if (is_vm_hugetlb_page(vma)) { > >>> + page = follow_huge_pte(vma, address, pmd, flags); > >>> + if (page) > >>> + return page; > >>> + return no_page_table(vma, flags); > >>> + } > >>> + > >>> ptep = pte_offset_map_lock(mm, pmd, address, &ptl); > >>> pte = *ptep; > >>> if (!pte_present(pte)) { > >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >>> index 6c00ba1..cf742d1 100644 > >>> --- a/mm/hugetlb.c > >>> +++ b/mm/hugetlb.c > >>> @@ -6981,6 +6981,59 @@ struct page * __weak > >>> return NULL; > >>> } > >>> > >>> +/* Support looking up a CONT-PTE size hugetlb page. */ > >>> +struct page * __weak > >>> +follow_huge_pte(struct vm_area_struct *vma, unsigned long address, > >>> + pmd_t *pmd, int flags) > >>> +{ > >>> + struct mm_struct *mm = vma->vm_mm; > >>> + struct hstate *hstate = hstate_vma(vma); > >>> + unsigned long size = huge_page_size(hstate); > >>> + struct page *page = NULL; > >>> + spinlock_t *ptl; > >>> + pte_t *ptep, pte; > >>> + > >>> + /* > >>> + * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via > >>> + * follow_hugetlb_page(). > >>> + */ > >>> + if (WARN_ON_ONCE(flags & FOLL_PIN)) > >>> + return NULL; > >>> + > >>> + ptep = huge_pte_offset(mm, address, size); > >>> + if (!ptep) > >>> + return NULL; > >>> + > >>> +retry: > >>> + ptl = huge_pte_lock(hstate, mm, ptep); > >>> + pte = huge_ptep_get(ptep); > >>> + if (pte_present(pte)) { > >>> + page = pte_page(pte); > >>> + if (WARN_ON_ONCE(!try_grab_page(page, flags))) { > >>> + page = NULL; > >>> + goto out; > >>> + } > >>> + } else { > >>> + if (!(flags & FOLL_MIGRATION)) { > >>> + page = NULL; > >>> + goto out; > >>> + } > >>> + > >>> + if (is_hugetlb_entry_migration(pte)) { > >>> + spin_unlock(ptl); > >>> + __migration_entry_wait_huge(ptep, ptl); > >>> + goto retry; > >>> + } > >>> + /* > >>> + * hwpoisoned entry is treated as no_page_table in > >>> + * follow_page_mask(). > >>> + */ > >>> + } > >>> +out: > >>> + spin_unlock(ptl); > >>> + return page; > >>> +} > >>> + > >>> struct page * __weak > >>> follow_huge_pmd(struct mm_struct *mm, unsigned long address, > >>> pmd_t *pmd, int flags) > >> > >> > >> Can someone explain why: > >> * follow_page() goes via follow_page_mask() for hugetlb > >> * __get_user_pages() goes via follow_hugetlb_page() and never via > >> follow_page_mask() for hugetlb? > >> > >> IOW, why can't we make follow_page_mask() just not handle hugetlb and > >> route everything via follow_hugetlb_page() -- we primarily only have to > >> teach it to not trigger faults. I have no idea how we got into this situation, and do agree that it makes little sense for both follow_page_mask and follow_hugetlb_page to do page table walking differently for hugetlb pages. I think I have noted elsewhere that all those follow_huge_p*d rotines will look the same. It seems they were just added as needed when the follow_page_mask page table walking code was fleshed out. This also needs a cleanup. If we eliminate hugetlb handling from follow_page_mask, perhaps we can get rid of all these? > > > > IMHO, these follow_huge_xxx() functions are arch-specified at first and > > were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm: > > hugetlb: Copy general hugetlb code from x86 to mm"), and now there are > > still some arch-specified follow_huge_xxx() definition, for example: > > ia64: follow_huge_addr > > powerpc: follow_huge_pd > > s390: follow_huge_pud > > > > What I mean is that follow_hugetlb_page() is a common and > > not-arch-specified function, is it suitable to change it to be > > arch-specified? > > And thinking more, can we rename follow_hugetlb_page() as > > hugetlb_page_faultin() and simplify it to only handle the page faults of > > hugetlb like the faultin_page() for normal page? That means we can make > > sure only follow_page_mask() can handle hugetlb. > > Something like that might work, but you still have two page table walkers for hugetlb. I like David's idea (if I understand it correctly) of using follow_hugetlb_page for both cases. As noted, it will need to be taught how to not trigger faults in the follow_page_mask case. -- Mike Kravetz > > If follow_hugetlb_page() can be arch-independent, why do we need the > other arch-dependent functions? > > It all looks a bit weird to have two functions that walk page tables and > are hugetlb aware. > > Either this screams for a cleanup or I am missing something fundamental. > > -- > Thanks, > > David / dhildenb