The patch titled Subject: mm,hwpoison: cleanup unused PageHuge() check has been added to the -mm tree. Its filename is mmhwpoison-cleanup-unused-pagehuge-check.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mmhwpoison-cleanup-unused-pagehuge-check.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mmhwpoison-cleanup-unused-pagehuge-check.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> Subject: mm,hwpoison: cleanup unused PageHuge() check Patch series "Hwpoison soft-offline rework", v4. This patchset was initially based on Naoya's hwpoison rework [1], so thanks to him for the initial work. I would also like to think Naoya for testing the patchset off-line, and report any issues he found, that was quite helpful. This patchset aims to fix some issues laying in soft-offline handling, but it also takes the chance and takes some further steps to perform cleanups and some refactoring as well. - Motivation: A customer and I were facing an issue were processes were killed after having soft-offlined some of their pages. This should not happen when soft-offlining, as it is meant to be non-disruptive. I was able to reproduce the issue when I stressed the memory + soft offlining pages in the meantime. After debugging the issue, I saw that the problem was that pages were returned back to user-space after having offlined them properly. So, when those pages were faulted in, the fault handler returned VM_FAULT_POISON all the way down to the arch handler, and it simply killed the process. After a further anaylsis, it became clear that the problem was that when kcompactd kicked in to migrate pages over, compaction_alloc callback was handing poisoned pages to the migrate routine. All this could happen because isolate_freepages_block and fast_isolate_freepages just check for the page to be PageBuddy, and since 1) poisoned pages can be part of a higher order page and 2) poisoned pages are also Page Buddy, they can sneak in easily. I also saw some other problems with sawap pages, but I suspected it to be the same sort of problem, so I did not follow that trace. The above refers to soft-offline. But I also saw problems with hard-offline, specially hugetlb corruption, and some other weird stuff. (I could paste the logs) The full explanation refering to the soft-offline case can be found at [2]. - Approach: The taken approach is to contain those pages and never let them hit neither pcplists nor buddy freelists. Only when they are completely out of reach, we flag them as poisoned. A full explanation of this can be found in patch#11 and patch#12 - Outcome: With this patchset, I no longer see the issues with soft-offline. [1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@xxxxxxxxxxxxx/ [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u This patch (of 15): Drop the PageHuge check since memory_failure forks into memory_failure_hugetlb() for hugetlb pages. Link: http://lkml.kernel.org/r/20200716123810.25292-1-osalvador@xxxxxxx Link: http://lkml.kernel.org/r/20200716123810.25292-2-osalvador@xxxxxxx Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> Signed-off-by: Oscar Salvador <osalvador@xxxxxxxx> Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxx> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Cc: David Hildenbrand <david@xxxxxxxxxx> Cc: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxx> Cc: Dmitry Yakunin <zeil@xxxxxxxxxxxxxx> Cc: Tony Luck <tony.luck@xxxxxxxxx> Cc: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/memory-failure.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) --- a/mm/memory-failure.c~mmhwpoison-cleanup-unused-pagehuge-check +++ a/mm/memory-failure.c @@ -1382,10 +1382,7 @@ int memory_failure(unsigned long pfn, in * page_remove_rmap() in try_to_unmap_one(). So to determine page status * correctly, we save a copy of the page flags at this time. */ - if (PageHuge(p)) - page_flags = hpage->flags; - else - page_flags = p->flags; + page_flags = p->flags; /* * unpoison always clear PG_hwpoison inside page lock _ Patches currently in -mm which might be from n-horiguchi@xxxxxxxxxxxxx are mmhwpoison-cleanup-unused-pagehuge-check.patch mmmadvise-call-soft_offline_page-without-mf_count_increased.patch mmhwpoison-inject-dont-pin-for-hwpoison_filter.patch mmhwpoison-remove-mf_count_increased.patch mmhwpoison-remove-flag-argument-from-soft-offline-functions.patch