On Sun, Apr 07, 2024 at 12:08:30AM +0000, Luck, Tony wrote: > Oscar. > > Both the 6.1 and 6.9-rc2 patches make the BUG (and subsequent issues) go away. Thanks for the switf test Tony! > Here's what's happening. > > When the machine check occurs there's a scramble from various subsystems > to report the memory error. > > ghes_do_memory_failure() calls memory_failure_queue() which later > calls memory_failure() from a kernel thread. Side note: this happens TWICE > for each error. Not sure yet if this is a BIOS issue logging more than once. > or some Linux issues in acpi/apei/ghes.c code. > > uc_decode_notifier() [called from a different kernel thread] also calls > do_memory_failure() > > Finally kill_me_maybe() [called from task_work on return to the application > when returning from the machine check handler] also calls memory_failure() > > do_memory_failure() is somewhat prepared for multiple reports of the same > error. It uses an atomic test and set operation to mark the page as poisoned. > > First called to report the error does all the real work. Late arrivals take a > shorter path, but may still take some action(s) depending on the "flags" > passed in: > > if (TestSetPageHWPoison(p)) { > pr_err("%#lx: already hardware poisoned\n", pfn); > res = -EHWPOISON; > if (flags & MF_ACTION_REQUIRED) > res = kill_accessing_process(current, pfn, flags); > if (flags & MF_COUNT_INCREASED) > put_page(p); > goto unlock_mutex; > } Thanks for the detailed explanation. > In this case the last to arrive has MF_ACTION_REQUIRED set, so calls > kill_accessing_process() ... which is in the stack trace that led to the: > > kernel BUG at include/linux/swapops.h:88! > > I'm not sure that I fully understand your patch. I guess that it is making sure to > handle the case that the page has already been marked as poisoned? Basically what is happening is: 1) We mark the page as HWPoison 2) We see that the page is mapped by someone 3) We try to unmap it, and in the process we create a hwpoison swap entry. See the following chunk from try_to_unmap_one(): " if (PageHWPoison(subpage) && (flags & TTU_HWPOISON)) { pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); if (folio_test_hugetlb(folio)) { hugetlb_count_sub(folio_nr_pages(folio), mm); set_huge_pte_at(mm, address, pvmw.pte, pteval, hsz); } else { dec_mm_counter(mm, mm_counter(folio)); set_pte_at(mm, address, pvmw.pte, pteval); } ... } " 4) Now there is a second memory event (maybe the previous one has already finished, I do not think it matters for the sake of this problem) 5) The second event sees that the page has already been marked as HWPoison but since it has MF_ACTION_REQUIRED specified, it goes to kill_accessing_process() to do what its name says. 6) We walk the page tables of the accessing process to see if it has the poisoned pfn. 7) check_hwpoisoned_entry() (which is called from walk_page_range()->walk_{pgd,p4d,pud,pmd}_range()->ops->pmd_entry()) checks whether any of the ptes is poisoned. 8) Since the previous MCE event unmapped the page, pte_present() == 0, so we want to get the swap entry, and this is where it falls off the cliff. See check_hwpoisoned_entry() static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift, unsigned long poisoned_pfn, struct to_kill *tk) { unsigned long pfn = 0; if (pte_present(pte)) { pfn = pte_pfn(pte); } else { swp_entry_t swp = pte_to_swp_entry(pte); if (is_hwpoison_entry(swp)) pfn = swp_offset_pfn(swp); } ... } is_hwpoison_entry() returns true (remember the make_hwpoison_entry() call we did?) But when we try to get the pfn from the swap entry, we stumble upon the VM_BUG_ON(), because is_pfn_swap_entry() only checks for: is_migration_entry() is_device_private_entry() is_device_exclusive_entry() but it should also check for is_hwpoison_entry(). Since it does not, is_pfn_swap_entry() returns false in our case, leading to the VM_BUG_ON. Note that this should only matter in environments where CONFIG_DEBUG_VM is set. I hope I shed some light in here. > Anyway ... thanks for the quick fix. I hope the above helps write a good > commit message to get this applied and backported to stable. > > Tested-by: Tony Luck <tony.luck@xxxxxxxxx> Thanks again Tony, much appreciated. I will write the patch and most likely send it out either today in the afternoon or tomorrow early in the morning. -- Oscar Salvador SUSE Labs