On Fri, May 10, 2024 at 12:26:00AM -0600, Jane Chu wrote: > Added two explicit MF_MSG messages describing failure in get_hwpoison_page. > Attemped to document the definition of various action names, and made a few > adjustment to the action_result() calls. > > Signed-off-by: Jane Chu <jane.chu@xxxxxxxxxx> ... > +/* > + * MF_IGNORED - Either the m-f() handler did nothing to recover, or it did "or if it " > + * something, then caught in a race condition which renders the effort sort "it was caught" I would also add to MF_IGNORED that we mark the page hwpoisoned anyway. > @@ -1018,7 +1034,7 @@ static int me_unknown(struct page_state *ps, struct page *p) > { > pr_err("%#lx: Unknown page state\n", page_to_pfn(p)); > unlock_page(p); > - return MF_FAILED; > + return MF_IGNORED; > } I was confused because I saw you replaced all MF_MSG_UNKNOWN, so I wondered how we can end up here until I saw this is a catch-all in case we fail to make sense of the page->flags. While you are improving this, I would suggest to add a little comment above the function explaining how we can reach it. > /* > @@ -2055,6 +2071,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb > if (flags & MF_ACTION_REQUIRED) { > folio = page_folio(p); > res = kill_accessing_process(current, folio_pfn(folio), flags); > + return action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED); I do not understand why are you doing this. First of all, why is this considered a failure? We did not fail this time, did we? We went right ahead and kill the process which was re-accessing the hwpoisoned page. Is that considered a failure? Second, you are know supressing -EHWPOISON with whatever action_result() will gives us, which judging from the actual code would be -EBUSY? I do not think that that is right, and we should be returning -EHWPOISON. Or whatever error code kill_accessing_process() gives us. > @@ -2231,6 +2248,7 @@ int memory_failure(unsigned long pfn, int flags) > res = kill_accessing_process(current, pfn, flags); > if (flags & MF_COUNT_INCREASED) > put_page(p); > + action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED); This is not coherent with what you did in try_memory_failure_hugetlb for MF_MSG_ALREADY_POISONED, I __think__ that in there we should be doing the same as we do here. -- Oscar Salvador SUSE Labs