On 08/24/22 17:41, Baolin Wang wrote: > > > On 8/24/2022 3:31 PM, David Hildenbrand wrote: > > > > > > > > > > > > IMHO, these follow_huge_xxx() functions are arch-specified at first and > > > > > > were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm: > > > > > > hugetlb: Copy general hugetlb code from x86 to mm"), and now there are > > > > > > still some arch-specified follow_huge_xxx() definition, for example: > > > > > > ia64: follow_huge_addr > > > > > > powerpc: follow_huge_pd > > > > > > s390: follow_huge_pud > > > > > > > > > > > > What I mean is that follow_hugetlb_page() is a common and > > > > > > not-arch-specified function, is it suitable to change it to be > > > > > > arch-specified? > > > > > > And thinking more, can we rename follow_hugetlb_page() as > > > > > > hugetlb_page_faultin() and simplify it to only handle the page faults of > > > > > > hugetlb like the faultin_page() for normal page? That means we can make > > > > > > sure only follow_page_mask() can handle hugetlb. > > > > > > > > > > > > > > Something like that might work, but you still have two page table walkers > > > > for hugetlb. I like David's idea (if I understand it correctly) of > > > > > > What I mean is we may change the hugetlb handling like normal page: > > > 1) use follow_page_mask() to look up a hugetlb firstly. > > > 2) if can not get the hugetlb, then try to page fault by > > > hugetlb_page_faultin(). > > > 3) if page fault successed, then retry to find hugetlb by > > > follow_page_mask(). > > > > That implies putting more hugetlbfs special code into generic GUP, > > turning it even more complicated. But of course, it depends on how the > > end result looks like. My gut feeling was that hugetlb is better handled > > in follow_hugetlb_page() separately (just like we do with a lot of other > > page table walkers). > > OK, fair enough. > > > > > > > Just a rough thought, and I need more investigation for my idea and > > > David's idea. > > > > > > > using follow_hugetlb_page for both cases. As noted, it will need to be > > > > taught how to not trigger faults in the follow_page_mask case. > > > > > > Anyway, I also agree we need some cleanup, and firstly I think we should > > > cleanup these arch-specified follow_huge_xxx() on some architectures > > > which are similar with the common ones. I will look into these. > > > > There was a recent discussion on that, e.g.: > > > > https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad > > Thanks. > > > > > > > > > However, considering cleanup may need more investigation and > > > refactoring, now I prefer to make these bug-fix patches of this patchset > > > into mainline firstly, which are suitable to backport to old version to > > > fix potential race issues. Mike and David, how do you think? Could you > > > help to review these patches? Thanks. > > > > Patch #1 certainly add more special code just to handle another hugetlb > > corner case (CONT pages), and maybe just making it all use > > follow_hugetlb_page() would be even cleaner and less error prone. > > > > I agree that locking is shaky, but I'm not sure if we really want to > > backport this to stable trees: > > > > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html > > > > "It must fix a real bug that bothers people (not a, “This could be a > > problem...” type thing)." > > > > > > Do we actually have any instance of this being a real (and not a > > theoretical) problem? If not, I'd rather clean it all up right away. > > I think this is a real problem (not theoretical), and easy to write some > code to show the issue. For example, suppose thread A is trying to look up a > CONT-PTE size hugetlb page under the lock, however antoher thread B can > migrate the CONT-PTE hugetlb page at the same time, which will cause thread > A to get an incorrect page, if thread A want to do something for this > incorrect page, error occurs. Is the primary concern the locking? If so, I am not sure we have an issue. As mentioned in your commit message, current code will use pte_offset_map_lock(). pte_offset_map_lock uses pte_lockptr, and pte_lockptr will either be the mm wide lock or pmd_page lock. To me, it seems that either would provide correct synchronization for CONT-PTE entries. Am I missing something or misreading the code? I started looking at code cleanup suggested by David. Here is a quick patch (not tested and likely containing errors) to see if this is a step in the right direction. I like it because we get rid of/combine all those follow_huge_p*d routines. >From 35d117a707c1567ddf350554298697d40eace0d7 Mon Sep 17 00:00:00 2001 From: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Date: Wed, 24 Aug 2022 15:59:15 -0700 Subject: [PATCH] hugetlb: call hugetlb_follow_page_mask for hugetlb pages in follow_page_mask At the beginning of follow_page_mask, there currently is a call to follow_huge_addr which 'may' handle hugetlb pages. ia64 is the only architecture which (incorrectly) provides a follow_huge_addr routine that does not return error. Instead, at each level of the page table a check is made for a hugetlb entry. If a hugetlb entry is found, a call to a routine associated with that page table level such as follow_huge_pmd is made. All the follow_huge_p*d routines are basically the same. In addition huge page size can be derived from the vma, so we know where in the page table a huge page would reside. So, replace follow_huge_addr with a new architecture independent routine which will provide the same functionality as the follow_huge_p*d routines. We can then eliminate the p*d_huge checks in follow_page_mask page table walking as well as the follow_huge_p*d routines themselves. follow_page_mask still has is_hugepd hugetlb checks during page table walking. This is due to these checks and follow_huge_pd being architecture specific. These can be eliminated if hugetlb_follow_page_mask can be overwritten by architectures (powerpc) that need to do follow_huge_pd processing. Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> --- arch/ia64/mm/hugetlbpage.c | 15 ---- arch/s390/mm/hugetlbpage.c | 10 --- include/linux/hugetlb.h | 41 +++------- mm/gup.c | 27 +------ mm/hugetlb.c | 159 ++++++++++++------------------------- 5 files changed, 62 insertions(+), 190 deletions(-) diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c index f993cb36c062..380d2f3966c9 100644 --- a/arch/ia64/mm/hugetlbpage.c +++ b/arch/ia64/mm/hugetlbpage.c @@ -91,21 +91,6 @@ int prepare_hugepage_range(struct file *file, return 0; } -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write) -{ - struct page *page; - pte_t *ptep; - - if (REGION_NUMBER(addr) != RGN_HPAGE) - return ERR_PTR(-EINVAL); - - ptep = huge_pte_offset(mm, addr, HPAGE_SIZE); - if (!ptep || pte_none(*ptep)) - return NULL; - page = pte_page(*ptep); - page += ((addr & ~HPAGE_MASK) >> PAGE_SHIFT); - return page; -} int pmd_huge(pmd_t pmd) { return 0; diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c index 10e51ef9c79a..c299a18273ff 100644 --- a/arch/s390/mm/hugetlbpage.c +++ b/arch/s390/mm/hugetlbpage.c @@ -237,16 +237,6 @@ int pud_huge(pud_t pud) return pud_large(pud); } -struct page * -follow_huge_pud(struct mm_struct *mm, unsigned long address, - pud_t *pud, int flags) -{ - if (flags & FOLL_GET) - return NULL; - - return pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); -} - bool __init arch_hugetlb_valid_size(unsigned long size) { if (MACHINE_HAS_EDAT1 && size == PMD_SIZE) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3ec981a0d8b3..0c19d200c851 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -142,6 +142,8 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma, unsigned long len); int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *, struct vm_area_struct *); +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, + unsigned long address, unsigned int flags); long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, unsigned long *, long, unsigned int, @@ -202,17 +204,9 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, pte_t *ptep); void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end); -struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, - int write); struct page *follow_huge_pd(struct vm_area_struct *vma, unsigned long address, hugepd_t hpd, int flags, int pdshift); -struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, - pmd_t *pmd, int flags); -struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address, - pud_t *pud, int flags); -struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address, - pgd_t *pgd, int flags); int pmd_huge(pmd_t pmd); int pud_huge(pud_t pud); @@ -257,6 +251,13 @@ static inline void adjust_range_if_pmd_sharing_possible( { } +static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + /* should never happen, but do not want to BUG */ + return ERR_PTR(-EINVAL); +} + static inline long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, @@ -267,12 +268,6 @@ static inline long follow_hugetlb_page(struct mm_struct *mm, return 0; } -static inline struct page *follow_huge_addr(struct mm_struct *mm, - unsigned long address, int write) -{ - return ERR_PTR(-EINVAL); -} - static inline int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *dst_vma, @@ -312,24 +307,6 @@ static inline struct page *follow_huge_pd(struct vm_area_struct *vma, return NULL; } -static inline struct page *follow_huge_pmd(struct mm_struct *mm, - unsigned long address, pmd_t *pmd, int flags) -{ - return NULL; -} - -static inline struct page *follow_huge_pud(struct mm_struct *mm, - unsigned long address, pud_t *pud, int flags) -{ - return NULL; -} - -static inline struct page *follow_huge_pgd(struct mm_struct *mm, - unsigned long address, pgd_t *pgd, int flags) -{ - return NULL; -} - static inline int prepare_hugepage_range(struct file *file, unsigned long addr, unsigned long len) { diff --git a/mm/gup.c b/mm/gup.c index 3b656b7e8a3c..a93c04437faa 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -661,12 +661,6 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, pmdval = READ_ONCE(*pmd); if (pmd_none(pmdval)) return no_page_table(vma, flags); - if (pmd_huge(pmdval) && is_vm_hugetlb_page(vma)) { - page = follow_huge_pmd(mm, address, pmd, flags); - if (page) - return page; - return no_page_table(vma, flags); - } if (is_hugepd(__hugepd(pmd_val(pmdval)))) { page = follow_huge_pd(vma, address, __hugepd(pmd_val(pmdval)), flags, @@ -764,12 +758,6 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, pud = pud_offset(p4dp, address); if (pud_none(*pud)) return no_page_table(vma, flags); - if (pud_huge(*pud) && is_vm_hugetlb_page(vma)) { - page = follow_huge_pud(mm, address, pud, flags); - if (page) - return page; - return no_page_table(vma, flags); - } if (is_hugepd(__hugepd(pud_val(*pud)))) { page = follow_huge_pd(vma, address, __hugepd(pud_val(*pud)), flags, @@ -851,24 +839,15 @@ static struct page *follow_page_mask(struct vm_area_struct *vma, ctx->page_mask = 0; - /* make this handle hugepd */ - page = follow_huge_addr(mm, address, flags & FOLL_WRITE); - if (!IS_ERR(page)) { - WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN)); - return page; - } + /* hugetlb is special */ + if (is_vm_hugetlb_page(vma)) + return hugetlb_follow_page_mask(vma, address, flags); pgd = pgd_offset(mm, address); if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) return no_page_table(vma, flags); - if (pgd_huge(*pgd)) { - page = follow_huge_pgd(mm, address, pgd, flags); - if (page) - return page; - return no_page_table(vma, flags); - } if (is_hugepd(__hugepd(pgd_val(*pgd)))) { page = follow_huge_pd(vma, address, __hugepd(pgd_val(*pgd)), flags, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6c00ba1dde32..947401df8190 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6168,6 +6168,56 @@ static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte, return false; } +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, + unsigned long address, unsigned int flags) +{ + struct hstate *h = hstate_vma(vma); + struct mm_struct *mm = vma->vm_mm; + unsigned long haddr = address & huge_page_mask(h); + struct page *page = NULL; + spinlock_t *ptl; + pte_t *pte, entry; + + /* + * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via + * follow_hugetlb_page(). + */ + if (WARN_ON_ONCE(flags & FOLL_PIN)) + return NULL; + + pte = huge_pte_offset(mm, haddr, huge_page_size(h)); + if (!pte) + return NULL; + +retry: + ptl = huge_pte_lock(h, mm, pte); + entry = huge_ptep_get(pte); + if (pte_present(entry)) { + page = pte_page(entry); + /* + * try_grab_page() should always succeed here, because we hold + * the ptl lock and have verified pte_present(). + */ + if (WARN_ON_ONCE(!try_grab_page(page, flags))) { + page = NULL; + goto out; + } + } else { + if (is_hugetlb_entry_migration(entry)) { + spin_unlock(ptl); + __migration_entry_wait_huge(pte, ptl); + goto retry; + } + /* + * hwpoisoned entry is treated as no_page_table in + * follow_page_mask(). + */ + } +out: + spin_unlock(ptl); + return page; +} + long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, struct page **pages, struct vm_area_struct **vmas, unsigned long *position, unsigned long *nr_pages, @@ -6966,13 +7016,6 @@ __weak unsigned long hugetlb_mask_last_page(struct hstate *h) * These functions are overwritable if your architecture needs its own * behavior. */ -struct page * __weak -follow_huge_addr(struct mm_struct *mm, unsigned long address, - int write) -{ - return ERR_PTR(-EINVAL); -} - struct page * __weak follow_huge_pd(struct vm_area_struct *vma, unsigned long address, hugepd_t hpd, int flags, int pdshift) @@ -6981,108 +7024,6 @@ follow_huge_pd(struct vm_area_struct *vma, return NULL; } -struct page * __weak -follow_huge_pmd(struct mm_struct *mm, unsigned long address, - pmd_t *pmd, int flags) -{ - struct page *page = NULL; - spinlock_t *ptl; - pte_t pte; - - /* - * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via - * follow_hugetlb_page(). - */ - if (WARN_ON_ONCE(flags & FOLL_PIN)) - return NULL; - -retry: - ptl = pmd_lockptr(mm, pmd); - spin_lock(ptl); - /* - * make sure that the address range covered by this pmd is not - * unmapped from other threads. - */ - if (!pmd_huge(*pmd)) - goto out; - pte = huge_ptep_get((pte_t *)pmd); - if (pte_present(pte)) { - page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); - /* - * try_grab_page() should always succeed here, because: a) we - * hold the pmd (ptl) lock, and b) we've just checked that the - * huge pmd (head) page is present in the page tables. The ptl - * prevents the head page and tail pages from being rearranged - * in any way. So this page must be available at this point, - * unless the page refcount overflowed: - */ - if (WARN_ON_ONCE(!try_grab_page(page, flags))) { - page = NULL; - goto out; - } - } else { - if (is_hugetlb_entry_migration(pte)) { - spin_unlock(ptl); - __migration_entry_wait_huge((pte_t *)pmd, ptl); - goto retry; - } - /* - * hwpoisoned entry is treated as no_page_table in - * follow_page_mask(). - */ - } -out: - spin_unlock(ptl); - return page; -} - -struct page * __weak -follow_huge_pud(struct mm_struct *mm, unsigned long address, - pud_t *pud, int flags) -{ - struct page *page = NULL; - spinlock_t *ptl; - pte_t pte; - - if (WARN_ON_ONCE(flags & FOLL_PIN)) - return NULL; - -retry: - ptl = huge_pte_lock(hstate_sizelog(PUD_SHIFT), mm, (pte_t *)pud); - if (!pud_huge(*pud)) - goto out; - pte = huge_ptep_get((pte_t *)pud); - if (pte_present(pte)) { - page = pud_page(*pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); - if (WARN_ON_ONCE(!try_grab_page(page, flags))) { - page = NULL; - goto out; - } - } else { - if (is_hugetlb_entry_migration(pte)) { - spin_unlock(ptl); - __migration_entry_wait(mm, (pte_t *)pud, ptl); - goto retry; - } - /* - * hwpoisoned entry is treated as no_page_table in - * follow_page_mask(). - */ - } -out: - spin_unlock(ptl); - return page; -} - -struct page * __weak -follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags) -{ - if (flags & (FOLL_GET | FOLL_PIN)) - return NULL; - - return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT); -} - int isolate_hugetlb(struct page *page, struct list_head *list) { int ret = 0; -- 2.37.1