On Fri, May 10, 2024 at 12:26:02AM -0600, Jane Chu wrote: > When handle hwpoison in a RDMA longterm pinned thp page, > try_to_split_thp_page() will fail. And at this point, there is > little else the kernel could do except sending a SIGBUS to > the user process, thus give it a chance to recover. Well, it does need to be a RDMA longterm pinned, right? Anything holding an extra refcount can already make us bite the dust, so I would not make it that specific. > Signed-off-by: Jane Chu <jane.chu@xxxxxxxxxx> > --- > mm/memory-failure.c | 31 ++++++++++++++++++++++++++----- > 1 file changed, 26 insertions(+), 5 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 2fa884d8b5a3..15bb1c0c42e8 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1697,7 +1697,7 @@ static int identify_page_state(unsigned long pfn, struct page *p, > return page_action(ps, p, pfn); > } > > -static int try_to_split_thp_page(struct page *page) > +static int try_to_split_thp_page(struct page *page, bool release) > { > int ret; > > @@ -1705,7 +1705,7 @@ static int try_to_split_thp_page(struct page *page) > ret = split_huge_page(page); > unlock_page(page); > > - if (unlikely(ret)) > + if (ret && release) > put_page(page); I would document whhen and when not we can release the page. E.g: we cannot release it if there are still processes mapping the thp. > +static int kill_procs_now(struct page *p, unsigned long pfn, int flags, > + struct folio *folio) > +{ > + LIST_HEAD(tokill); > + > + collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED); > + kill_procs(&tokill, true, pfn, flags); > + > + return -EHWPOISON; You are returning -EHWPOISON here, > +} > + > /** > * memory_failure - Handle memory failure of a page. > * @pfn: Page Number of the corrupted page > @@ -2313,8 +2331,11 @@ int memory_failure(unsigned long pfn, int flags) > * page is a valid handlable page. > */ > folio_set_has_hwpoisoned(folio); > - if (try_to_split_thp_page(p) < 0) { > - res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED); > + if (try_to_split_thp_page(p, false) < 0) { > + pr_err("%#lx: thp split failed\n", pfn); > + res = kill_procs_now(p, pfn, flags, folio); > + put_page(p); > + res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_FAILED); just to overwrite it here with action_result(). Which one do we need? I think we would need -EBUSY here, right? So I would drop the retcode from kill_procs_now. Also, do we want the extra pr_err() here. action_result() will already provide us the pfn and the action_page_types which will be "unsplit thp". Is not that clear enough? I would drop that. -- Oscar Salvador SUSE Labs