On 2024/5/7 4:26, Jane Chu wrote: > On 5/5/2024 12:00 AM, Miaohe Lin wrote: > >> On 2024/5/2 7:24, Jane Chu wrote: >>> When handle hwpoison in a GUP longterm pin'ed thp page, >>> try_to_split_thp_page() will fail. And at this point, there is little else >>> the kernel could do except sending a SIGBUS to the user process, thus >>> give it a chance to recover. >>> >>> Signed-off-by: Jane Chu <jane.chu@xxxxxxxxxx> >> Thanks for your patch. Some comments below. >> >>> --- >>> mm/memory-failure.c | 36 ++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 36 insertions(+) >>> >>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >>> index 7fcf182abb96..67f4d24a98e7 100644 >>> --- a/mm/memory-failure.c >>> +++ b/mm/memory-failure.c >>> @@ -2168,6 +2168,37 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, >>> return rc; >>> } >>> +/* >>> + * The calling condition is as such: thp split failed, page might have >>> + * been GUP longterm pinned, not much can be done for recovery. >>> + * But a SIGBUS should be delivered with vaddr provided so that the user >>> + * application has a chance to recover. Also, application processes' >>> + * election for MCE early killed will be honored. >>> + */ >>> +static int kill_procs_now(struct page *p, unsigned long pfn, int flags, >>> + struct page *hpage) >>> +{ >>> + struct folio *folio = page_folio(hpage); >>> + LIST_HEAD(tokill); >>> + int res = -EHWPOISON; >>> + >>> + /* deal with user pages only */ >>> + if (PageReserved(p) || PageSlab(p) || PageTable(p) || PageOffline(p)) >>> + res = -EBUSY; >>> + if (!(PageLRU(hpage) || PageHuge(p))) >>> + res = -EBUSY; >> Above checks seems unneeded. We already know it's thp? > > Agreed. > > I lifted these checks from hwpoison_user_mapping() with a hope to make kill_procs_now() more generic, > > such as, potentially replacing kill_accessing_processes() for re-accessing hwpoisoned page. > > But I backed out at last, due to concerns that my tests might not have covered sufficient number of scenarios. > >> >>> + >>> + if (res == -EHWPOISON) { >>> + collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED); >>> + kill_procs(&tokill, true, pfn, flags); >>> + } >>> + >>> + if (flags & MF_COUNT_INCREASED) >>> + put_page(p); >> This if block is broken. put_page() has been done when try_to_split_thp_page() fails? > > put_page() has not been done if try_to_split_thp_page() fails, and I think it should. In try_to_split_thp_page(), if split_huge_page fails, i.e. ret != 0, put_page() is called. See below: static int try_to_split_thp_page(struct page *page) { int ret; lock_page(page); ret = split_huge_page(page); unlock_page(page); if (unlikely(ret)) put_page(page); ^^^^^^^^^^^^^^^^^^^^^^^ return ret; } Or am I miss something? > > I will revise the code so that put_page() is called regardless MF_ACTION_REQUIRED is set or not. > >> >>> + >> action_result is missing? > > Indeed, action_result() isn't always called, referring to the re-accessing hwpoison scenarios. > > In this case, I think the reason is that, we just killed the process and there is nothing > > else to do or to report. > >> >>> + return res; >>> +} >>> + >>> /** >>> * memory_failure - Handle memory failure of a page. >>> * @pfn: Page Number of the corrupted page >>> @@ -2297,6 +2328,11 @@ int memory_failure(unsigned long pfn, int flags) >>> */ >>> SetPageHasHWPoisoned(hpage); >>> if (try_to_split_thp_page(p) < 0) { >> Should hwpoison_filter() be called in this case? > Yes, it should. I will add the hwpoison_filter check. >> >>> + if (flags & MF_ACTION_REQUIRED) { Only in MF_ACTION_REQUIRED case, SIGBUS is sent to processes when thp split failed. Any reson under it? Thanks. .