Hi everyone, I wrote hwpoison patches which partially mention the problems discussed recently on this area [1]. Main point of this series is how we isolate faulty pages more safely/reliable. As pointed out from Michal in thread [2], we can have better isolation functions rather than what we currently have. Patch 8/11 gives the implementation. As a result, the behavior of poisoned pages (at least from soft-offline) are more predictable and I think that memory hotremove should properly work with it. The structure of this series: - patch 1-7 are small fixes, preparation, and/or cleanup. I can separate these out from main part if you like. - patch 8 is core part of this series, providing some code to pick out the target page from buddy allocator, - patch 9-11 are changes on caller sides (hard-offline, hotremove and unpoison.) One big issue not addressed by this series is hard-offlining hugetlb, which is still a todo unfortunately. Another remaining work is to rework on the behavior of PG_hwpoison flag from hard-offlining of in-use page. Even with this series, hard-offline for in-use pages works as in the past (i.e. we still take racy "set PG_hwpoison at first, then do some handling" approach.) Without changing this, we can't be free from many "if (PageHWPoison)" checks in mm code. So I'll think/try more about it after this one. Anyway this is the first step for better solution (I believe,) and any kind of help is applicated. Thanks, Naoya Horiguchi [1]: https://lwn.net/Articles/753261/ [2]: https://lkml.org/lkml/2018/7/17/60 --- Summary: Naoya Horiguchi (11): mm: hwpoison: cleanup unused PageHuge() check mm: soft-offline: add missing error check of set_hwpoison_free_buddy_page() mm: move definition of num_poisoned_pages_inc/dec to include/linux/mm.h mm: madvise: call soft_offline_page() without MF_COUNT_INCREASED mm: hwpoison-inject: don't pin for hwpoison_filter() mm: hwpoison: remove MF_COUNT_INCREASED mm: remove flag argument from soft offline functions mm: soft-offline: isolate error pages from buddy freelist mm: hwpoison: apply buddy page handling code to hard-offline mm: clear PageHWPoison in memory hotremove mm: hwpoison: introduce clear_hwpoison_free_buddy_page() drivers/base/memory.c | 2 +- include/linux/mm.h | 22 ++++++--- include/linux/page-flags.h | 8 +++- include/linux/swapops.h | 16 ------- mm/hwpoison-inject.c | 18 ++------ mm/madvise.c | 25 +++++----- mm/memory-failure.c | 112 ++++++++++++++++++++++++++------------------- mm/migrate.c | 9 ---- mm/page_alloc.c | 95 +++++++++++++++++++++++++++++++++++--- mm/sparse.c | 2 +- 10 files changed, 193 insertions(+), 116 deletions(-)