On Tue, Apr 5, 2022 at 1:48 PM Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote: > > On Thu, Mar 24, 2022 at 7:20 PM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > On Wed, Mar 23, 2022 at 4:29 PM Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote: > > > > > > Make __collapse_huge_page_copy return whether > > > collapsing/copying anonymous pages succeeded, > > > and make collapse_huge_page handle the return status. > > > > > > Break existing PTE scan loop into two for-loops. > > > The first loop copies source pages into target huge page, > > > and can fail gracefully when running into memory errors in > > > source pages. Roll back the page table and page states > > > in the 2nd loop when copy failed: > > > 1) re-establish the PTEs-to-PMD connection. > > > 2) release pages back to their LRU list. > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx> > > > --- > > > include/linux/highmem.h | 19 ++++++ > > > mm/khugepaged.c | 136 ++++++++++++++++++++++++++++++---------- > > > 2 files changed, 122 insertions(+), 33 deletions(-) > > > > > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > > > index 39bb9b47fa9c..15d0aa4d349c 100644 > > > --- a/include/linux/highmem.h > > > +++ b/include/linux/highmem.h > > > @@ -281,6 +281,25 @@ static inline void copy_user_highpage(struct page *to, struct page *from, > > > kunmap_local(vfrom); > > > } > > > > > > +/* > > > + * Machine check exception handled version of copy_user_highpage. > > > + * Return true if copying page content failed; otherwise false. > > > + */ > > > +static inline bool copy_user_highpage_mc(struct page *to, struct page *from, > > > + unsigned long vaddr, struct vm_area_struct *vma) > > > > Patch #2 defined copy_highpage_mc() which has the same implementation. > > This function has two unused parameters: vaddr and vma. It seems we > > just need to keep one. > > > > Version 2 will delete copy_user_highpage_mc() from the patch series, > and just keep copy_highpage_mc(). > > > > +{ > > > + char *vfrom, *vto; > > > + unsigned long ret; > > > + > > > + vfrom = kmap_local_page(from); > > > + vto = kmap_local_page(to); > > > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > > > + kunmap_local(vto); > > > + kunmap_local(vfrom); > > > + > > > + return ret > 0; > > > +} > > > + > > > #endif > > > > > > #ifndef __HAVE_ARCH_COPY_HIGHPAGE > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index 131492fd1148..84ed177f56ff 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -739,44 +739,97 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > > > return 0; > > > } > > > > > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > > > - struct vm_area_struct *vma, > > > - unsigned long address, > > > - spinlock_t *ptl, > > > - struct list_head *compound_pagelist) > > > +/* > > > + * __collapse_huge_page_copy - attempts to copy memory contents from normal > > > + * pages to a hugepage. Cleanup the normal pages if copy succeeds; > > > + * otherwise restore the original pmd page table. > > > + * > > > + * @pte: starting of the PTEs to copy from > > > + * @page: the new hugepage to copy contents to > > > + * @pmd: pointer to the new hugepage's PMD > > > + * @rollback: the original normal PTEs' PMD > > > + * @address: starting address to copy > > > + * @pte_ptl: lock on normal pages' PTEs > > > + * @compound_pagelist: list that stores compound pages > > > + */ > > > +static bool __collapse_huge_page_copy(pte_t *pte, > > > + struct page *page, > > > + pmd_t *pmd, > > > + pmd_t rollback, > > > + struct vm_area_struct *vma, > > > + unsigned long address, > > > + spinlock_t *pte_ptl, > > > + struct list_head *compound_pagelist) > > > { > > > struct page *src_page, *tmp; > > > pte_t *_pte; > > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > > > - _pte++, page++, address += PAGE_SIZE) { > > > - pte_t pteval = *_pte; > > > + pte_t pteval; > > > + unsigned long _address; > > > + spinlock_t *pmd_ptl; > > > + bool copy_succeeded = true; > > > > > > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > > + /* > > > + * Copying pages' contents is subject to memory poison at any iteration. > > > + */ > > > + for (_pte = pte, _address = address; > > > + _pte < pte + HPAGE_PMD_NR; > > > + _pte++, page++, _address += PAGE_SIZE) { > > > + pteval = *_pte; > > > + > > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) > > > clear_user_highpage(page, address); > > > - add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > > > - if (is_zero_pfn(pte_pfn(pteval))) { > > > - /* > > > - * ptl mostly unnecessary. > > > - */ > > > - spin_lock(ptl); > > > - ptep_clear(vma->vm_mm, address, _pte); > > > - spin_unlock(ptl); > > > + else { > > > + src_page = pte_page(pteval); > > > + if (copy_user_highpage_mc(page, src_page, address, vma)) { > > > + copy_succeeded = false; > > > + break; > > > + } > > > + } > > > + } > > > + > > > + if (!copy_succeeded) { > > > + /* > > > + * Copying failed, re-establish the regular PMD that > > > + * points to regular page table. Since PTEs are still > > > + * isolated and locked, acquiring anon_vma_lock is unnecessary. > > > + */ > > > + pmd_ptl = pmd_lock(vma->vm_mm, pmd); > > > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback)); > > > + spin_unlock(pmd_ptl); > > > > I think we could just jump to list_for_each_entry_safe, right? But it > > seems that list_for_each_entry_safe is not enough, please see the > > later comments. > > Yes, it is not enough, we need to release_pte_page() for non-compound pages. > > If we are going to use jump/goto, we need two for-loops over PTEs, one > for the copy > succeeded case, and one for the failure case. What I have right now is combining > the two for-loops into one. > I don't like to switch to jump because: 1) goto statements have > bad readability, 2) one for-loop is more concise than two for-loops. Makes sense to me, it is not worth two loops. > > > > > > + } > > > + > > > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR; > > > + _pte++, _address += PAGE_SIZE) { > > > + pteval = *_pte; > > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > > + if (copy_succeeded) { > > > > With the above jump, you don't need to check if copy is succeeded or > > not since this is the succeeded path only. > > > > > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > > > + if (is_zero_pfn(pte_pfn(pteval))) { > > > + /* > > > + * ptl mostly unnecessary. > > > + */ > > > + spin_lock(pte_ptl); > > > + pte_clear(vma->vm_mm, _address, _pte); > > > + spin_unlock(pte_ptl); > > > + } > > > } > > > } else { > > > src_page = pte_page(pteval); > > > - copy_user_highpage(page, src_page, address, vma); > > > if (!PageCompound(src_page)) > > > release_pte_page(src_page); > > > - /* > > > - * ptl mostly unnecessary, but preempt has to > > > - * be disabled to update the per-cpu stats > > > - * inside page_remove_rmap(). > > > - */ > > > - spin_lock(ptl); > > > - ptep_clear(vma->vm_mm, address, _pte); > > > - page_remove_rmap(src_page, false); > > > - spin_unlock(ptl); > > > - free_page_and_swap_cache(src_page); > > > + > > > + if (copy_succeeded) { > > > + /* > > > + * ptl mostly unnecessary, but preempt has to > > > + * be disabled to update the per-cpu stats > > > + * inside page_remove_rmap(). > > > + */ > > > + spin_lock(pte_ptl); > > > + pte_clear(vma->vm_mm, _address, _pte); > > > + page_remove_rmap(src_page, false); > > > + spin_unlock(pte_ptl); > > > + free_page_and_swap_cache(src_page); > > > + } > > > } > > > } > > > > > > @@ -784,6 +837,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > > > list_del(&src_page->lru); > > > release_pte_page(src_page); > > > } > > > > If copy failed you need to unpin the isolated pages and put them back > > to LRU IIUC. > > No matter if the copy failed or succeeded, we need to use a for-loop > to unpin isolated pages > and put them back to LRU (i.e. release_pte_page()). In current code, > this is a common > operation in the above for-loop. > > > > > > + > > > + return copy_succeeded; > > > } > > > > > > static void khugepaged_alloc_sleep(void) > > > @@ -1066,6 +1121,7 @@ static void collapse_huge_page(struct mm_struct *mm, > > > struct vm_area_struct *vma; > > > struct mmu_notifier_range range; > > > gfp_t gfp; > > > + bool copy_succeeded = false; > > > > > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > > > > > @@ -1177,9 +1233,13 @@ static void collapse_huge_page(struct mm_struct *mm, > > > */ > > > anon_vma_unlock_write(vma->anon_vma); > > > > > > - __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl, > > > - &compound_pagelist); > > > + copy_succeeded = __collapse_huge_page_copy(pte, new_page, pmd, _pmd, > > > + vma, address, pte_ptl, &compound_pagelist); > > > pte_unmap(pte); > > > + if (!copy_succeeded) { > > > > Shall the fail handling be moved before pte_unmap()? > > I don't think so. I think at this point we need to pte_unmap() no > matter if the copying > succeeded or not. It is obviously needed if __collapse_huge_page_copy succeeded > (because existing code already assumes __collapse_huge_page_copy > always succeeded). > For the failure case, my reasoning is based on __collapse_huge_page_isolate. > If __collapse_huge_page_isolate failed, we need to pte_unmap() > before goto out_up_write. So I would assume if __collapse_huge_page_copy failed, > pte_unmap() is also required. Yeah, make sense. > > > > > > + result = SCAN_FAIL; > > > > I think a new result is preferred, for example, SCAN_COPY_MC, it would > > be helpful for debug. > > Version 2 will add this new enum. > > > > > > > + goto out_up_write; > > > + } > > > /* > > > * spin_lock() below is not the equivalent of smp_wmb(), but > > > * the smp_wmb() inside __SetPageUptodate() can be reused to > > > @@ -1364,9 +1424,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > > > pte_unmap_unlock(pte, ptl); > > > if (ret) { > > > node = khugepaged_find_target_node(); > > > - /* collapse_huge_page will return with the mmap_lock released */ > > > - collapse_huge_page(mm, address, hpage, node, > > > - referenced, unmapped); > > > + /* > > > + * collapse_huge_page will return with the mmap_r+w_lock released. > > > + * It is uncertain if *hpage is NULL or not when collapse_huge_page > > > + * returns, so keep ret=1 to jump to breakouterloop_mmap_lock > > > + * in khugepaged_scan_mm_slot, then *hpage will be freed > > > + * if collapse failed. > > > > It may be not true for !NUMA case, the huge page may be reused, so you > > need to make sure the huge page is cleared before return, I had a > > patch that remove the special case for !NUMA > > (https://lore.kernel.org/linux-mm/20210817202146.3218-1-shy828301@xxxxxxxxx/), > > it could avoid clearing huge page, I should resurrect it. > > Thanks for catching this! Clearing the hugepage before returning is > for security reasons > or to avoid memory corruption, right? Yes, but it should be fine since the users typically clear the page before using it, for example, do_anonymous_page() does call alloc_zeroed_user_highpage_movable() to allocate a page, which clears the page before returning. So this should not block you, but remind me to relook my patch. > If my understanding is correct, once your patch is merged, this will > be good. Your patch > seems to be a very simple one, do you expect it to be merged soon? It still has some unsolved review comments, and I didn't get time to work on it yet. > > > > > > + */ > > > + collapse_huge_page(mm, address, hpage, node, referenced, unmapped); > > > } > > > out: > > > trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced, > > > @@ -2168,6 +2233,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > > > khugepaged_scan_file(mm, file, pgoff, hpage); > > > fput(file); > > > } else { > > > + /* > > > + * mmap_read_lock is > > > + * 1) released if both scan and collapse succeeded; > > > + * 2) still held if either scan or collapse failed. > > > + */ > > > ret = khugepaged_scan_pmd(mm, vma, > > > khugepaged_scan.address, > > > hpage); > > > -- > > > 2.35.1.894.gb6a874cedc-goog > > > > > Thanks for the quick and valuable feedback. > I will send out the v2 RFC, and please leave/reply there. Thanks! I may not be able to take a deep look at v2 soon, I will be off for the following week. I will take a look at it once I'm back.