On Mon, Jan 20, 2025 at 09:18:49PM -0700, Jane Chu wrote: > When a process consumes a UE in a page, the memory failure handler > attempts to collect information for a potential SIGBUS. > If the page is an anonymous page, page_mapped_in_vma(page, vma) is > invoked in order to > 1. retrieve the vaddr from the process' address space, > 2. verify that the vaddr is indeed mapped to the poisoned page, > where 'page' is the precise small page with UE. > > It's been observed that when injecting poison to a non-head subpage > of an anonymous hugetlb page, no SIGBUS show up; while injecting to > the head page produces a SIGBUS. The casue is that, though hugetlb_walk() > returns a valid pmd entry (on x86), but check_pte() detects mismatch > between the head page per the pmd and the input subpage. Thus the vaddr > is considered not mapped to the subpage and the process is not collected > for SIGBUS purpose. This is the calling stack > collect_procs_anon > page_mapped_in_vma > page_vma_mapped_walk > hugetlb_walk > huge_pte_lock > check_pte > > It seems that the most obvious place to fix the issue is by making > page_mapped_in_vma() hugetlb walk aware. The precise subpage in the > input is useful in providing PAGE_SIZE granularity vaddr. I don't like this solution because it adds yet another special case for hugetlb. If we don't split a PMD-mapped THP, we'd have the same problem, right? check_pte() would succeed if we set pvmw->pfn to folio_pfn() and pvmw->nr_pages to folio_nr_pages(), right? I just don't know what else might be affected by that. I like one of these two options: @@ -206,6 +206,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) pvmw->pte = hugetlb_walk(vma, pvmw->address, size); if (!pvmw->pte) return false; + pvmw->pte += pvmw->address & (size - PAGE_SIZE); pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); if (!check_pte(pvmw)) (that needs a bit of tidying up; you can't just do that, but I think you get the basic idea -- correct the pte to point to the precise page instead of the hugetlb pfn) The option I really prefer is much more work but matches our preferred direction of getting rid of hugetlb specific code. Something like this: @@ -192,27 +192,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) if (pvmw->pmd && !pvmw->pte) return not_found(pvmw); - if (unlikely(is_vm_hugetlb_page(vma))) { - struct hstate *hstate = hstate_vma(vma); - unsigned long size = huge_page_size(hstate); - /* The only possible mapping was handled on last iteration */ [...] - pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte); - if (!check_pte(pvmw)) - return not_found(pvmw); - return true; - } - end = vma_address_end(pvmw); if (pvmw->pte) goto next_pte; @@ -229,7 +208,19 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw continue; } pud = pud_offset(p4d, pvmw->address); - if (!pud_present(*pud)) { + pude = *pud; + if (pud_trans_huge(pude) || + (pud_present(pude) && pud_devmap(pude))) { + pvmw->ptl = pud_lock(mm, pvmw->pud); + ... + if (likely(pud_trans_huge(pude) || pud_devmap(pude))) { + if (pvmw->flags & PVMW_MIGRATION) + return not_found(pvmw); + if (!check_pud(pud_pfn(pude), pvmw)) + return not_found(pvmw); + return true; + } + } else if (!pud_present(pude)) { step_forward(pvmw, PUD_SIZE); continue; } ie get rid of all the hugetlb-specific code, and add support for the PUD level to the common code. You'd also need to write check_pud(). I'll understand if you don't want to do all the extra work. And thanks for tracking down this bug.