On Fri, May 27, 2022 at 8:42 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > On Thu, May 26, 2022 at 5:06 PM Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote: > > > > On Tue, May 24, 2022 at 11:41 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > > > On Mon, May 23, 2022 at 7:54 PM Jiaqi Yan <jiaqiyan@xxxxxxxxxx> wrote: > > > > > > > > Make __collapse_huge_page_copy return whether > > > > collapsing/copying anonymous pages succeeded, > > > > and make collapse_huge_page handle the return status. > > > > > > > > Break existing PTE scan loop into two for-loops. > > > > The first loop copies source pages into target huge page, > > > > and can fail gracefully when running into memory errors in > > > > source pages. Roll back the page table and page states > > > > when copying failed: > > > > 1) re-establish the PTEs-to-PMD connection. > > > > 2) release pages back to their LRU list. > > > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx> > > > > --- > > > > include/linux/highmem.h | 19 ++++ > > > > include/trace/events/huge_memory.h | 27 +++++- > > > > mm/khugepaged.c | 146 ++++++++++++++++++++++------- > > > > 3 files changed, 158 insertions(+), 34 deletions(-) > > > > > > > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > > > > index 39bb9b47fa9cd..0ccb1e92c4b06 100644 > > > > --- a/include/linux/highmem.h > > > > +++ b/include/linux/highmem.h > > > > @@ -298,6 +298,25 @@ static inline void copy_highpage(struct page *to, struct page *from) > > > > > > > > #endif > > > > > > > > +/* > > > > + * Machine check exception handled version of copy_highpage. > > > > + * Return true if copying page content failed; otherwise false. > > > > + * Note handling #MC requires arch opt-in. > > > > + */ > > > > +static inline bool copy_highpage_mc(struct page *to, struct page *from) > > > > +{ > > > > + char *vfrom, *vto; > > > > + unsigned long ret; > > > > + > > > > + vfrom = kmap_local_page(from); > > > > + vto = kmap_local_page(to); > > > > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > > > > + kunmap_local(vto); > > > > + kunmap_local(vfrom); > > > > + > > > > + return ret > 0; > > > > +} > > > > + > > > > static inline void memcpy_page(struct page *dst_page, size_t dst_off, > > > > struct page *src_page, size_t src_off, > > > > size_t len) > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > > > > index 4fdb14a81108b..7e02c0aa2f728 100644 > > > > --- a/include/trace/events/huge_memory.h > > > > +++ b/include/trace/events/huge_memory.h > > > > @@ -34,7 +34,8 @@ > > > > EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \ > > > > EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \ > > > > EM( SCAN_TRUNCATED, "truncated") \ > > > > - EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > > > > + EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > > > > + EMe(SCAN_COPY_MC, "copy_poisoned_page") \ > > > > > > > > #undef EM > > > > #undef EMe > > > > @@ -167,5 +168,29 @@ TRACE_EVENT(mm_collapse_huge_page_swapin, > > > > __entry->ret) > > > > ); > > > > > > > > +TRACE_EVENT(mm_collapse_huge_page_copy, > > > > > > You may misunderstand my point. I don't mean we need a new tracepoint, > > > IMHO trace_mm_collapse_huge_page() is good enough. When copy is failed > > > you do have "result = SCAN_COPY_MC" and the result will be consumed by > > > trace_mm_collapse_huge_page(). You just need have trace interpret the > > > new result as the above section does. > > > > I see. To make sure we are on the same page, I think we only need to add: > > > > EMe(SCAN_COPY_MC, "copy_poisoned_page") > > Yes > > > > > I don't think we need to add a 'copied' arg in mm_collapse_huge_page() as > > "copy_poisoned_page" is the only reason copied=false. > > > > My intention to add a new tracepoint is for showing the poisoned PFN but I feel > > you don't think this is very important information w.r.t khugepaged. > > The important > > thing is we skipped the poisoned page(s). > > I think memory failure will report the detailed information later. > Khugepaged just needs to know the reason IMHO. > Eventually yes. For example if userspace accesses the memory error, then memory_failure() will be invoked in the MCE handling code path. A future workitem could be to make the memory_failure() happen right away in the khugepaged context. But for now I don't have a very good concrete scenario to demonstrate that doing this is better than wait memory_failure() happening later. Anyway, this is not intended to be included in this patch series. I will send out v4 asap. > > > > > > > > > + > > > > + TP_PROTO(struct page *start, struct page *poisoned, int status), > > > > + > > > > + TP_ARGS(start, poisoned, status), > > > > + > > > > + TP_STRUCT__entry( > > > > + __field(unsigned long, start_pfn) > > > > + __field(unsigned long, poisoned_pfn) > > > > + __field(int, status) > > > > + ), > > > > + > > > > + TP_fast_assign( > > > > + __entry->start_pfn = start ? page_to_pfn(start) : -1; > > > > + __entry->poisoned_pfn = poisoned ? page_to_pfn(poisoned) : -1; > > > > + __entry->status = status; > > > > + ), > > > > + > > > > + TP_printk("start_pfn=0x%lx, poisoned_pfn=0x%lx, status=%s", > > > > + __entry->start_pfn, > > > > + __entry->poisoned_pfn, > > > > + __print_symbolic(__entry->status, SCAN_STATUS)) > > > > +); > > > > + > > > > #endif /* __HUGE_MEMORY_H */ > > > > #include <trace/define_trace.h> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > > index 131492fd1148b..1b08e31ba081a 100644 > > > > --- a/mm/khugepaged.c > > > > +++ b/mm/khugepaged.c > > > > @@ -52,6 +52,7 @@ enum scan_result { > > > > SCAN_CGROUP_CHARGE_FAIL, > > > > SCAN_TRUNCATED, > > > > SCAN_PAGE_HAS_PRIVATE, > > > > + SCAN_COPY_MC, > > > > }; > > > > > > > > #define CREATE_TRACE_POINTS > > > > @@ -739,44 +740,104 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > > > > return 0; > > > > } > > > > > > > > -static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > > > > - struct vm_area_struct *vma, > > > > - unsigned long address, > > > > - spinlock_t *ptl, > > > > - struct list_head *compound_pagelist) > > > > +/* > > > > + * __collapse_huge_page_copy - attempts to copy memory contents from normal > > > > + * pages to a hugepage. Cleanup the normal pages if copying succeeds; > > > > + * otherwise restore the original page table and release isolated normal pages. > > > > + * Returns true if copying succeeds, otherwise returns false. > > > > + * > > > > + * @pte: starting of the PTEs to copy from > > > > + * @page: the new hugepage to copy contents to > > > > + * @pmd: pointer to the new hugepage's PMD > > > > + * @rollback: the orginal normal pages' PMD > > > > > > s/orginal/original > > > > > > > + * @address: starting address to copy > > > > + * @pte_ptl: lock on normal pages' PTEs > > > > + * @compound_pagelist: list that stores compound pages > > > > + */ > > > > +static bool __collapse_huge_page_copy(pte_t *pte, > > > > + struct page *page, > > > > + pmd_t *pmd, > > > > + pmd_t rollback, > > > > + struct vm_area_struct *vma, > > > > + unsigned long address, > > > > + spinlock_t *pte_ptl, > > > > + struct list_head *compound_pagelist) > > > > { > > > > struct page *src_page, *tmp; > > > > pte_t *_pte; > > > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > > > > - _pte++, page++, address += PAGE_SIZE) { > > > > - pte_t pteval = *_pte; > > > > + pte_t pteval; > > > > + unsigned long _address; > > > > + spinlock_t *pmd_ptl; > > > > + bool copy_succeeded = true; > > > > > > > > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > > > + /* > > > > + * Copying pages' contents is subject to memory poison at any iteration. > > > > + */ > > > > + for (_pte = pte, _address = address; > > > > + _pte < pte + HPAGE_PMD_NR; > > > > + _pte++, page++, _address += PAGE_SIZE) { > > > > + pteval = *_pte; > > > > + > > > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) > > > > clear_user_highpage(page, address); > > > > - add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > > > > - if (is_zero_pfn(pte_pfn(pteval))) { > > > > - /* > > > > - * ptl mostly unnecessary. > > > > - */ > > > > - spin_lock(ptl); > > > > - ptep_clear(vma->vm_mm, address, _pte); > > > > - spin_unlock(ptl); > > > > + else { > > > > + src_page = pte_page(pteval); > > > > + if (copy_highpage_mc(page, src_page)) { > > > > + copy_succeeded = false; > > > > + trace_mm_collapse_huge_page_copy(pte_page(*pte), > > > > + src_page, SCAN_COPY_MC); > > > > > > As aforementioned I don't think we need a new trace point. > > > > > > > + break; > > > > + } > > > > + } > > > > + } > > > > + > > > > + if (copy_succeeded) > > > > + trace_mm_collapse_huge_page_copy(pte_page(*pte), > > > > + NULL, SCAN_SUCCEED); > > > > + else { > > > > + /* > > > > + * Copying failed, re-establish the regular PMD that points to > > > > + * the regular page table. Restoring PMD needs to be done prior > > > > + * to releasing pages. Since pages are still isolated and locked > > > > + * here, acquiring anon_vma_lock_write is unnecessary. > > > > + */ > > > > + pmd_ptl = pmd_lock(vma->vm_mm, pmd); > > > > + pmd_populate(vma->vm_mm, pmd, pmd_pgtable(rollback)); > > > > + spin_unlock(pmd_ptl); > > > > + } > > > > + > > > > + for (_pte = pte, _address = address; _pte < pte + HPAGE_PMD_NR; > > > > + _pte++, _address += PAGE_SIZE) { > > > > + pteval = *_pte; > > > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > > > > + if (copy_succeeded) { > > > > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > > > > + if (is_zero_pfn(pte_pfn(pteval))) { > > > > + /* > > > > + * ptl mostly unnecessary. > > > > + */ > > > > + spin_lock(pte_ptl); > > > > + pte_clear(vma->vm_mm, _address, _pte); > > > > + spin_unlock(pte_ptl); > > > > + } > > > > } > > > > } else { > > > > src_page = pte_page(pteval); > > > > - copy_user_highpage(page, src_page, address, vma); > > > > if (!PageCompound(src_page)) > > > > release_pte_page(src_page); > > > > - /* > > > > - * ptl mostly unnecessary, but preempt has to > > > > - * be disabled to update the per-cpu stats > > > > - * inside page_remove_rmap(). > > > > - */ > > > > - spin_lock(ptl); > > > > - ptep_clear(vma->vm_mm, address, _pte); > > > > - page_remove_rmap(src_page, false); > > > > - spin_unlock(ptl); > > > > - free_page_and_swap_cache(src_page); > > > > + > > > > + if (copy_succeeded) { > > > > + /* > > > > + * ptl mostly unnecessary, but preempt has to > > > > + * be disabled to update the per-cpu stats > > > > + * inside page_remove_rmap(). > > > > + */ > > > > + spin_lock(pte_ptl); > > > > + pte_clear(vma->vm_mm, _address, _pte); > > > > + page_remove_rmap(src_page, false); > > > > + spin_unlock(pte_ptl); > > > > + free_page_and_swap_cache(src_page); > > > > + } > > > > } > > > > } > > > > > > > > @@ -784,6 +845,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, > > > > list_del(&src_page->lru); > > > > release_pte_page(src_page); > > > > } > > > > + > > > > + return copy_succeeded; > > > > } > > > > > > > > static void khugepaged_alloc_sleep(void) > > > > @@ -1066,6 +1129,7 @@ static void collapse_huge_page(struct mm_struct *mm, > > > > struct vm_area_struct *vma; > > > > struct mmu_notifier_range range; > > > > gfp_t gfp; > > > > + bool copied = false; > > > > > > > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > > > > > > > @@ -1177,9 +1241,13 @@ static void collapse_huge_page(struct mm_struct *mm, > > > > */ > > > > anon_vma_unlock_write(vma->anon_vma); > > > > > > > > - __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl, > > > > - &compound_pagelist); > > > > + copied = __collapse_huge_page_copy(pte, new_page, pmd, _pmd, > > > > + vma, address, pte_ptl, &compound_pagelist); > > > > pte_unmap(pte); > > > > + if (!copied) { > > > > + result = SCAN_COPY_MC; > > > > + goto out_up_write; > > > > + } > > > > /* > > > > * spin_lock() below is not the equivalent of smp_wmb(), but > > > > * the smp_wmb() inside __SetPageUptodate() can be reused to > > > > @@ -1364,9 +1432,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > > > > pte_unmap_unlock(pte, ptl); > > > > if (ret) { > > > > node = khugepaged_find_target_node(); > > > > - /* collapse_huge_page will return with the mmap_lock released */ > > > > - collapse_huge_page(mm, address, hpage, node, > > > > - referenced, unmapped); > > > > + /* > > > > + * collapse_huge_page will return with the mmap_r+w_lock released. > > > > + * It is uncertain if *hpage is NULL or not when collapse_huge_page > > > > > > Why is it uncertain? > > > > By "uncertain" I mean *hpage may not be set to NULL near the end of > > collapse_huge_page > > if swapin/isolation/copying failed. > > > > > > > > > + * returns, so keep ret=1 to jump to breakouterloop_mmap_lock > > > > > > "ret=1" means scan succeeded so mmap_lock is released. But this > > > happens in the caller. > > > > > > > + * in khugepaged_scan_mm_slot, then *hpage will be freed > > > > + * if collapse failed. > > > > > > It depends. For CONFIG_NUMA, it will be freed since next scan may need > > > a page from a different node, but !CONFIG_NUMA may not free it, it may > > > be reused IIUC. We do intend to remove this optimization, but it has > > > not happened yet. > > > > Yes, you are right if !CONFIG_NUMA, khugepaged_prealloc_page only > > free *hpage when page_count(*hpage) > 1. > > > > > > > > TBH I don't think the comment is helpful here and actually confusing, > > > I'd prefer to have it right before calling khugepaged_scan_pmd() if > > > you would like to keep it. > > > > I will remove the comment and revert this part. > > > > > > > > > + */ > > > > + collapse_huge_page(mm, address, hpage, node, referenced, unmapped); > > > > } > > > > out: > > > > trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced, > > > > @@ -2168,6 +2241,13 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > > > > khugepaged_scan_file(mm, file, pgoff, hpage); > > > > fput(file); > > > > } else { > > > > + /* > > > > + * mmap_read_lock is > > > > + * 1) still held if scan failed; > > > > + * 2) released if scan succeeded. > > > > + * It is not affected by collapsing or copying > > > > + * operations. > > > > + */ > > > > ret = khugepaged_scan_pmd(mm, vma, > > > > khugepaged_scan.address, > > > > hpage); > > > > -- > > > > 2.36.1.124.g0e6072fb45-goog > > > >