+ mmhwpoison-cleanup-unused-pagehuge-check.patch added to -mm tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The patch titled
     Subject: mm,hwpoison: cleanup unused PageHuge() check
has been added to the -mm tree.  Its filename is
     mmhwpoison-cleanup-unused-pagehuge-check.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/mmhwpoison-cleanup-unused-pagehuge-check.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/mmhwpoison-cleanup-unused-pagehuge-check.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
Subject: mm,hwpoison: cleanup unused PageHuge() check

Patch series "Hwpoison soft-offline rework", v4.

This patchset was initially based on Naoya's hwpoison rework [1], so
thanks to him for the initial work.  I would also like to think Naoya for
testing the patchset off-line, and report any issues he found, that was
quite helpful.

This patchset aims to fix some issues laying in soft-offline handling, but
it also takes the chance and takes some further steps to perform cleanups
and some refactoring as well.


 - Motivation:

   A customer and I were facing an issue were processes were killed
   after having soft-offlined some of their pages.  This should not happen
   when soft-offlining, as it is meant to be non-disruptive.  I was able
   to reproduce the issue when I stressed the memory + soft offlining
   pages in the meantime.

   After debugging the issue, I saw that the problem was that pages
   were returned back to user-space after having offlined them properly. 
   So, when those pages were faulted in, the fault handler returned
   VM_FAULT_POISON all the way down to the arch handler, and it simply
   killed the process.

   After a further anaylsis, it became clear that the problem was that
   when kcompactd kicked in to migrate pages over, compaction_alloc
   callback was handing poisoned pages to the migrate routine.

   All this could happen because isolate_freepages_block and
   fast_isolate_freepages just check for the page to be PageBuddy, and
   since 1) poisoned pages can be part of a higher order page and 2)
   poisoned pages are also Page Buddy, they can sneak in easily.

   I also saw some other problems with sawap pages, but I suspected it
   to be the same sort of problem, so I did not follow that trace.

   The above refers to soft-offline.  But I also saw problems with
   hard-offline, specially hugetlb corruption, and some other weird stuff.
   (I could paste the logs)

   The full explanation refering to the soft-offline case can be found at [2].

 - Approach:

   The taken approach is to contain those pages and never let them hit
   neither pcplists nor buddy freelists.  Only when they are completely
   out of reach, we flag them as poisoned.

   A full explanation of this can be found in patch#11 and patch#12

 - Outcome:

   With this patchset, I no longer see the issues with soft-offline.

[1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@xxxxxxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u


This patch (of 15):

Drop the PageHuge check since memory_failure forks into memory_failure_hugetlb()
for hugetlb pages.

Link: http://lkml.kernel.org/r/20200716123810.25292-1-osalvador@xxxxxxx
Link: http://lkml.kernel.org/r/20200716123810.25292-2-osalvador@xxxxxxx
Signed-off-by: Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx>
Signed-off-by: Oscar Salvador <osalvador@xxxxxxxx>
Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxx>
Cc: Dmitry Yakunin <zeil@xxxxxxxxxxxxxx>
Cc: Tony Luck <tony.luck@xxxxxxxxx>
Cc: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
---

 mm/memory-failure.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

--- a/mm/memory-failure.c~mmhwpoison-cleanup-unused-pagehuge-check
+++ a/mm/memory-failure.c
@@ -1382,10 +1382,7 @@ int memory_failure(unsigned long pfn, in
 	 * page_remove_rmap() in try_to_unmap_one(). So to determine page status
 	 * correctly, we save a copy of the page flags at this time.
 	 */
-	if (PageHuge(p))
-		page_flags = hpage->flags;
-	else
-		page_flags = p->flags;
+	page_flags = p->flags;
 
 	/*
 	 * unpoison always clear PG_hwpoison inside page lock
_

Patches currently in -mm which might be from n-horiguchi@xxxxxxxxxxxxx are

mmhwpoison-cleanup-unused-pagehuge-check.patch
mmmadvise-call-soft_offline_page-without-mf_count_increased.patch
mmhwpoison-inject-dont-pin-for-hwpoison_filter.patch
mmhwpoison-remove-mf_count_increased.patch
mmhwpoison-remove-flag-argument-from-soft-offline-functions.patch




[Index of Archives]     [Kernel Archive]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]

  Powered by Linux