On 23.08.22 12:02, Baolin Wang wrote: > > > On 8/23/2022 4:29 PM, David Hildenbrand wrote: >> On 23.08.22 09:50, Baolin Wang wrote: >>> On some architectures (like ARM64), it can support CONT-PTE/PMD size >>> hugetlb, which means it can support not only PMD/PUD size hugetlb >>> (2M and 1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size >>> specified. >>> >>> So when looking up a CONT-PTE size hugetlb page by follow_page(), it >>> will use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE >>> size hugetlb in follow_page_pte(). However this pte entry lock is incorrect >>> for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to >>> get the correct lock, which is mm->page_table_lock. >>> >>> That means the pte entry of the CONT-PTE size hugetlb under current >>> pte lock is unstable in follow_page_pte(), we can continue to migrate >>> or poison the pte entry of the CONT-PTE size hugetlb, which can cause >>> some potential race issues, and following pte_xxx() validation is also >>> unstable in follow_page_pte(), even though they are under the 'pte lock'. >>> >>> Moreover we should use huge_ptep_get() to get the pte entry value of >>> the CONT-PTE size hugetlb, which already takes into account the subpages' >>> dirty or young bits in case we missed the dirty or young state of the >>> CONT-PTE size hugetlb. >>> >>> To fix above issues, introducing a new helper follow_huge_pte() to look >>> up a CONT-PTE size hugetlb page, which uses huge_pte_lock() to get the >>> correct pte entry lock to make the pte entry stable, as well as >>> supporting non-present pte handling. >>> >>> Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx> >>> --- >>> include/linux/hugetlb.h | 8 ++++++++ >>> mm/gup.c | 11 ++++++++++ >>> mm/hugetlb.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++ >>> 3 files changed, 72 insertions(+) >>> >>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h >>> index 3ec981a..d491138 100644 >>> --- a/include/linux/hugetlb.h >>> +++ b/include/linux/hugetlb.h >>> @@ -207,6 +207,8 @@ struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, >>> struct page *follow_huge_pd(struct vm_area_struct *vma, >>> unsigned long address, hugepd_t hpd, >>> int flags, int pdshift); >>> +struct page *follow_huge_pte(struct vm_area_struct *vma, unsigned long address, >>> + pmd_t *pmd, int flags); >>> struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, >>> pmd_t *pmd, int flags); >>> struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address, >>> @@ -312,6 +314,12 @@ static inline struct page *follow_huge_pd(struct vm_area_struct *vma, >>> return NULL; >>> } >>> >>> +static inline struct page *follow_huge_pte(struct vm_area_struct *vma, >>> + unsigned long address, pmd_t *pmd, int flags) >>> +{ >>> + return NULL; >>> +} >>> + >>> static inline struct page *follow_huge_pmd(struct mm_struct *mm, >>> unsigned long address, pmd_t *pmd, int flags) >>> { >>> diff --git a/mm/gup.c b/mm/gup.c >>> index 3b656b7..87a94f5 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -534,6 +534,17 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, >>> if (unlikely(pmd_bad(*pmd))) >>> return no_page_table(vma, flags); >>> >>> + /* >>> + * Considering PTE level hugetlb, like continuous-PTE hugetlb on >>> + * ARM64 architecture. >>> + */ >>> + if (is_vm_hugetlb_page(vma)) { >>> + page = follow_huge_pte(vma, address, pmd, flags); >>> + if (page) >>> + return page; >>> + return no_page_table(vma, flags); >>> + } >>> + >>> ptep = pte_offset_map_lock(mm, pmd, address, &ptl); >>> pte = *ptep; >>> if (!pte_present(pte)) { >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 6c00ba1..cf742d1 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -6981,6 +6981,59 @@ struct page * __weak >>> return NULL; >>> } >>> >>> +/* Support looking up a CONT-PTE size hugetlb page. */ >>> +struct page * __weak >>> +follow_huge_pte(struct vm_area_struct *vma, unsigned long address, >>> + pmd_t *pmd, int flags) >>> +{ >>> + struct mm_struct *mm = vma->vm_mm; >>> + struct hstate *hstate = hstate_vma(vma); >>> + unsigned long size = huge_page_size(hstate); >>> + struct page *page = NULL; >>> + spinlock_t *ptl; >>> + pte_t *ptep, pte; >>> + >>> + /* >>> + * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via >>> + * follow_hugetlb_page(). >>> + */ >>> + if (WARN_ON_ONCE(flags & FOLL_PIN)) >>> + return NULL; >>> + >>> + ptep = huge_pte_offset(mm, address, size); >>> + if (!ptep) >>> + return NULL; >>> + >>> +retry: >>> + ptl = huge_pte_lock(hstate, mm, ptep); >>> + pte = huge_ptep_get(ptep); >>> + if (pte_present(pte)) { >>> + page = pte_page(pte); >>> + if (WARN_ON_ONCE(!try_grab_page(page, flags))) { >>> + page = NULL; >>> + goto out; >>> + } >>> + } else { >>> + if (!(flags & FOLL_MIGRATION)) { >>> + page = NULL; >>> + goto out; >>> + } >>> + >>> + if (is_hugetlb_entry_migration(pte)) { >>> + spin_unlock(ptl); >>> + __migration_entry_wait_huge(ptep, ptl); >>> + goto retry; >>> + } >>> + /* >>> + * hwpoisoned entry is treated as no_page_table in >>> + * follow_page_mask(). >>> + */ >>> + } >>> +out: >>> + spin_unlock(ptl); >>> + return page; >>> +} >>> + >>> struct page * __weak >>> follow_huge_pmd(struct mm_struct *mm, unsigned long address, >>> pmd_t *pmd, int flags) >> >> >> Can someone explain why: >> * follow_page() goes via follow_page_mask() for hugetlb >> * __get_user_pages() goes via follow_hugetlb_page() and never via >> follow_page_mask() for hugetlb? >> >> IOW, why can't we make follow_page_mask() just not handle hugetlb and >> route everything via follow_hugetlb_page() -- we primarily only have to >> teach it to not trigger faults. > > IMHO, these follow_huge_xxx() functions are arch-specified at first and > were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm: > hugetlb: Copy general hugetlb code from x86 to mm"), and now there are > still some arch-specified follow_huge_xxx() definition, for example: > ia64: follow_huge_addr > powerpc: follow_huge_pd > s390: follow_huge_pud > > What I mean is that follow_hugetlb_page() is a common and > not-arch-specified function, is it suitable to change it to be > arch-specified? > And thinking more, can we rename follow_hugetlb_page() as > hugetlb_page_faultin() and simplify it to only handle the page faults of > hugetlb like the faultin_page() for normal page? That means we can make > sure only follow_page_mask() can handle hugetlb. > If follow_hugetlb_page() can be arch-independent, why do we need the other arch-dependent functions? It all looks a bit weird to have two functions that walk page tables and are hugetlb aware. Either this screams for a cleanup or I am missing something fundamental. -- Thanks, David / dhildenb