Goal ==== Currently once a byte in a HugeTLB hugepage becomes HWPOISON, the whole hugepage will be unmapped from the page table because that is the finest granularity of the mapping. High granularity mapping (HGM) [1], the functionality to map memory addresses at finer granularities (extreme case is PAGE_SIZE), is recently proposed upstream, and provides the opportunity to handle memory error more efficiently: instead of unmapping the whole hugepage, only the raw subpage in the hugepage needs to be thrown away and all the healthy subpages can still be kept available for users. Idea ==== Today memory failure recovery for HugeTLB pages (hugepage) is different from raw and THP pages. We are only interested in in-use hugepages, which is dealt with in these simplified steps: 1. Increment the refcount on the compound head of the hugepage. 2. Insert the raw HWPOISON page to the compound head’s raw_hwp_list (_hugetlb_hwpoison) if it is not already in the list. 3. Unmap the entire hugepage from HugeTLB’s page table. 4. Kill the processes that are accessing the poisoned hugepage. HGM can greatly improve this recovery mechanism. Step #3 (unmapping entire hugepage) can be replaced by 3.1 Map the entire hugepage at finer granularity, so that the exact HWPOISON address is mapped by a PAGE_SIZE PTE, and the rest of the address spaces optimally mapped by either smaller P*Ds or PTEs. In other words, the original HugeTLB PTE is split into smaller P*Ds and PTEs. 3.2 Only unmap the newly mapped PTE that maps the HWPOISON address. For shared mappings, current HGM patches is already a solid basis for splitting functionality in step #3.1. This RFC drafts a complete solution for shared mapping. The splitting-based idea can be applied to private mappings as well, but additional subtle complexity needs to be dealt with. We defer the private mapping case as future work. Splitting HugeTLB PTEs (Step #3.1) ================================== The general process of splitting a present leaf HugeTLB PTE is 1. Get and clear the original HugeTLB PTE old_pte. 2. Initialize curr with the start address range corresponding to old_pte. 3. Find the optimal level we should map curr at. 4. Perform HGM walk on curr with the optimal level found in step 3, potentially allocating a new PTE at the optimal level. 5. Populate the newly allocated PTE with bits from old_pte, including dirty, write, and UFFD_WP. 6. Update curr += the newly created PTE size, repeat step 3 until the entire VMA is covered. The functionality of splitting hugepage mapping is not meaningful for mostly none PTEs. We handle none or userfaultfd write protect (UFFD_WP) marker HugeTLB PTEs at the time of page faulting. Migration and HWPOISON PTEs are better left not touched. Memory Failure Recovery and Unmapping (Step #3.2) ================================================= A few changes are made in memory_failure and rmap to only unmap raw HWPOISON pages: 1. as long as HGM is turned on in CONFIG, memory_failure attempts to enable HGM on the VMA containing the poisoned hugepage 2. memory_failure attempts to split the HugeTLB PTE so that poisoned address is mapped by a PAGE_SIZE PTE, for all the VMAs containing the poisoned hugepage. 3. get_huge_page_for_hwpoison only returns -EHWPOISON if the raw page is already in the compound head’s raw_hwp_list. This makes unmapping work correctly when multiple raw pages in the same hugepage become HWPOISON. 4. rmap utilizes compound head’s raw_hwp_list to 1) avoid unmapping raw pages not in the list, and 2) keep track if the raw pages in the list are already unmapped. 5. page refcount check in me_huge_page is skipped. Between mmap() and Page Fault ========================== Memory error can occur between the time when userspace maps a hugepage and the time when userspace faults in the mapped hugepage. General idea is to not create any raw-page-size page table entry for HWPOISON memory, and render memory in healthy raw pages still available to userspace (via normal fault handling). At the time of hugetlb_no_page: - If the entire hugepage doesn’t contain any HWPOISON page, the normal page fault handler continues. - If the memory address being faulted is within a HWPOISON raw page, hugetlb_no_page returns VM_FAULT_HWPOISON_LARGE (so that page fault handler sends a BUS_MCEERR_AR SIGBUS to the faulting process). - If the memory address being faulted is within a healthy raw page, hugetlb_no_page utilize HGM to create a new HugeTLB PTE so that its hugetlb_pte_size cannot be larger and at the same time it doesn’t map any HWPOISON address. Then the normal page fault handler continues. Failure Handling ================ - If the kernel still fails to allocate a new raw_hwp_page after a retry, memory_failure returns MF_IGNORED with MF_MSG_UNKNOWN. - For each VMA that maps the HWPOISON hugepage - If the VMA is not eligible for HGM, the old behavior is taken: unmap the entire hugepage from that VMA. - If memory_failure fails to enable HGM on the VMA, or if memory_failure fails to split any VMA that mapped the HWPOISON page, the recovery returns MF_IGNORED with MF_MSG_UNMAP_FAILED. - For a particular VMA, if splitting HugeTLB PTE fails, the original PTE will be restored to the page table. Code Changes ============ The code patches in this RFC is based on HGM patchset V2 [1], composed of two parts. The first part implements the idea laid out in the cover letter; the second part tests two major scenarios: HWPOISON on already faulted pages and HWPOISON between mapped and faulted. Future Changes ============== There is a pending improvement to hugetlbfs_read_iter. If a hugepage is found from page cache and it contains HWPOISON subpages, today kernel returns -EIO immediately. With the new splitting-then-unmap behavior, kernel can return userspace every byte until up to the first raw HWPOISON byte. If userspace wants the read to start within a raw HWPOISON page, kernel will have to return -EIO. This improvement and its selftest will be done in the future patch series. [1] https://lore.kernel.org/all/20230218002819.1486479-1-jthoughton@xxxxxxxxxx/ Jiaqi Yan (7): hugetlb: add HugeTLB splitting functionality hugetlb: create PTE level mapping when possible mm: publish raw_hwp_page in mm.h mm/memory_failure: unmap raw HWPoison PTEs when possible hugetlb: only VM_FAULT_HWPOISON_LARGE raw page selftest/mm: test PAGESIZE unmapping HWPOISON pages selftest/mm: test PAGESIZE unmapping UFFD WP marker HWPOISON pages include/linux/hugetlb.h | 14 + include/linux/mm.h | 36 ++ mm/hugetlb.c | 405 ++++++++++++++++++++++- mm/memory-failure.c | 206 ++++++++++-- mm/rmap.c | 38 ++- tools/testing/selftests/mm/hugetlb-hgm.c | 364 ++++++++++++++++++-- 6 files changed, 1004 insertions(+), 59 deletions(-) -- 2.40.1.495.gc816e09b53d-goog