On Sun, Apr 07, 2024 at 03:56:17PM +0200, Oscar Salvador wrote: > On Sun, Apr 07, 2024 at 03:05:37PM +0200, Oscar Salvador wrote: > > Tony reported that the Machine check recovery was broken in v6.9-rc1, > > as he was hitting a VM_BUG_ON when injecting uncorrectable memory errors > > to DRAM. > > After some more digging and debugging on his side, he realized that this > > went back to v6.1, with the introduction of 'commit 0d206b5d2e0d ("mm/swap: add > > swp_offset_pfn() to fetch PFN from swap entry")'. > > That commit, among other things, introduced swp_offset_pfn(), replacing > > hwpoison_entry_to_pfn() in its favour. > > > > The patch also introduced a VM_BUG_ON() check for is_pfn_swap_entry(), > > but is_pfn_swap_entry() never got updated to cover hwpoison entries, which > > means that we would hit the VM_BUG_ON whenever we would call > > swp_offset_pfn() for such entries on environments with CONFIG_DEBUG_VM set. > > Fix this by updating the check to cover hwpoison entries as well, and update > > the comment while we are it. > > > > Reported-by: Tony Luck <tony.luck@xxxxxxxxx> > > Closes: https://lore.kernel.org/all/Zg8kLSl2yAlA3o5D@agluck-desk3/ > > Tested-by: Tony Luck <tony.luck@xxxxxxxxx> > > Fixes: 0d206b5d2e0d ("mm/swap: add swp_offset_pfn() to fetch PFN from swap entry") Totally unexpected, as this commit even removed hwpoison_entry_to_pfn(). Obviously even until now I assumed hwpoison is accounted as pfn swap entry but it's just missing.. Since this commit didn't really change is_pfn_swap_entry() itself, I was thinking maybe an older fix tag would apply, but then I noticed the old code indeed should work well even if hwpoison entry is missing. For example, it's a grey area on whether a hwpoisoned page should be accounted in smaps. So I think the Fixes tag is correct, and thanks for fixing this. Reviewed-by: Peter Xu <peterx@xxxxxxxxxx> > > Cc: <stable@xxxxxxxxxxxxxxx> # 6.1.x > > I think I need to clarify why the stable. > > It is my understanding that some distros ship their kernel with > CONFIG_DEBUG_VM set by default (I think Fedora comes to my mind?). > I am fine with backing down if people think that this is an > overreaction. Fedora stopped having DEBUG_VM for some time, but not sure about when it's still in the 6.1 trees. It looks like cc stable is still reasonable from that regard. A side note is that when I'm looking at this, I went back and see why in some cases we need the pfn maintained for the poisoned, then I saw the only user is check_hwpoisoned_entry() who wants to do fast kills in some contexts and that includes a double check on the pfns in a poisoned entry. Then afaict this path is just too rarely used and buggy. A few things we may need fixing, maybe someone in the loop would have time to have a look: - check_hwpoisoned_entry() - pte_none check is missing - all the rest swap types are missing (e.g., we want to kill the proc too if the page is during migration) - check_hwpoisoned_pmd_entry() - need similar care like above (pmd_none is covered not others) I copied Naoya too. Thanks, -- Peter Xu