On Fri, Jan 20, 2023 at 07:56:15AM -0800, Jiaqi Yan wrote: > On Thu, Jan 19, 2023 at 7:03 AM <kirill.shutemov@xxxxxxxxxxxxxxx> wrote: > > > > On Mon, Dec 05, 2022 at 03:40:58PM -0800, Jiaqi Yan wrote: > > > Make __collapse_huge_page_copy return whether copying anonymous pages > > > succeeded, and make collapse_huge_page handle the return status. > > > > > > Break existing PTE scan loop into two for-loops. The first loop copies > > > source pages into target huge page, and can fail gracefully when running > > > into memory errors in source pages. If copying all pages succeeds, the > > > second loop releases and clears up these normal pages. Otherwise, the > > > second loop rolls back the page table and page states by: > > > - re-establishing the original PTEs-to-PMD connection. > > > - releasing source pages back to their LRU list. > > > > > > Tested manually: > > > 0. Enable khugepaged on system under test. > > > 1. Start a two-thread application. Each thread allocates a chunk of > > > non-huge anonymous memory buffer. > > > 2. Pick 4 random buffer locations (2 in each thread) and inject > > > uncorrectable memory errors at corresponding physical addresses. > > > 3. Signal both threads to make their memory buffer collapsible, i.e. > > > calling madvise(MADV_HUGEPAGE). > > > 4. Wait and check kernel log: khugepaged is able to recover from poisoned > > > pages and skips collapsing them. > > > 5. Signal both threads to inspect their buffer contents and make sure no > > > data corruption. > > > > > > Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx> > > > --- > > > include/trace/events/huge_memory.h | 3 +- > > > mm/khugepaged.c | 179 ++++++++++++++++++++++------- > > > 2 files changed, 139 insertions(+), 43 deletions(-) > > > > > > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h > > > index 35d759d3b0104..5743ae970af31 100644 > > > --- a/include/trace/events/huge_memory.h > > > +++ b/include/trace/events/huge_memory.h > > > @@ -36,7 +36,8 @@ > > > EM( SCAN_ALLOC_HUGE_PAGE_FAIL, "alloc_huge_page_failed") \ > > > EM( SCAN_CGROUP_CHARGE_FAIL, "ccgroup_charge_failed") \ > > > EM( SCAN_TRUNCATED, "truncated") \ > > > - EMe(SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > > > + EM( SCAN_PAGE_HAS_PRIVATE, "page_has_private") \ > > > + EMe(SCAN_COPY_MC, "copy_poisoned_page") \ > > > > > > #undef EM > > > #undef EMe > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index 5a7d2d5093f9c..0f1b9e05e17ec 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -19,6 +19,7 @@ > > > #include <linux/page_table_check.h> > > > #include <linux/swapops.h> > > > #include <linux/shmem_fs.h> > > > +#include <linux/kmsan.h> > > > > > > #include <asm/tlb.h> > > > #include <asm/pgalloc.h> > > > @@ -55,6 +56,7 @@ enum scan_result { > > > SCAN_CGROUP_CHARGE_FAIL, > > > SCAN_TRUNCATED, > > > SCAN_PAGE_HAS_PRIVATE, > > > + SCAN_COPY_MC, > > > }; > > > > > > #define CREATE_TRACE_POINTS > > > @@ -530,6 +532,27 @@ static bool is_refcount_suitable(struct page *page) > > > return page_count(page) == expected_refcount; > > > } > > > > > > +/* > > > + * Copies memory with #MC in source page (@from) handled. Returns number > > > + * of bytes not copied if there was an exception; otherwise 0 for success. > > > + * Note handling #MC requires arch opt-in. > > > + */ > > > +static int copy_mc_page(struct page *to, struct page *from) > > > +{ > > > + char *vfrom, *vto; > > > + unsigned long ret; > > > + > > > + vfrom = kmap_local_page(from); > > > + vto = kmap_local_page(to); > > > + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE); > > > + if (ret == 0) > > > + kmsan_copy_page_meta(to, from); > > > + kunmap_local(vto); > > > + kunmap_local(vfrom); > > > + > > > + return ret; > > > +} > > > > > > It is very similar to copy_mc_user_highpage(), but uses > > kmsan_copy_page_meta() instead of kmsan_unpoison_memory(). > > > > Could you explain the difference? I don't quite get it. > > copy_mc_page is actually the MC version of copy_highpage, which uses > kmsan_copy_page_meta instead of kmsan_unpoison_memory. > > My understanding is kmsan_copy_page_meta covers kmsan_unpoison_memory. > When there is no metadata (kmsan_shadow or kmsan_origin), both > kmsan_copy_page_meta and kmsan_unpoison_memory just do > kmsan_internal_unpoison_memory to mark the memory range as > initialized; when there is metadata in src page, kmsan_copy_page_meta > will copy whatever metadata in src to dst. So I think > kmsan_copy_page_meta is the right thing to do. Should we fix copy_mc_user_highpage() then? > > Indentation levels get out of control. Maybe some code restructuring is > > required? > > v10 will change to something like this to reduce 1 level of indentation: > > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) > continue; > src_page = pte_page(pteval); > if (!PageCompound(src_page)) > release_pte_page(src_page); I hoped for deeper rework. Maybe split the function into several functions and make overall structure more readable? -- Kiryl Shutsemau / Kirill A. Shutemov