On 2025/2/11 14:02, Shuai Xue wrote: > When an uncorrected memory error is consumed there is a race between > the CMCI from the memory controller reporting an uncorrected error > with a UCNA signature, and the core reporting and SRAR signature > machine check when the data is about to be consumed. > > If the CMCI wins that race, the page is marked poisoned when > uc_decode_notifier() calls memory_failure(). For dirty pages, > memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, > converting the PTE to a hwpoison entry. However, for clean pages, the > TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted > to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is > marked as a hwpoison entry allowing kill_accessing_process() to: > > - call walk_page_range() and return 1 > - call kill_proc() to make sure a SIGBUS is sent > - return -EHWPOISON to indicate that SIGBUS is already sent to the process > and kill_me_maybe() doesn't have to send it again. > > Conversely, for clean pages where PTE entries are not marked as hwpoison, > kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a > SIGBUS. > > Console log looks like this: > > Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects > Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered > Memory failure: 0x827ca68: already hardware poisoned > mce: Memory error not recovered > > To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing > an unnecessary SIGBUS. Thanks for your patch. > > Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") > Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> > --- > mm/memory-failure.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 995a15eb67e2..f9a6b136a6f0 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, > (void *)&priv); > if (ret == 1 && priv.tk.addr) > kill_proc(&priv.tk, pfn, flags); > - else > - ret = 0; > mmap_read_unlock(p->mm); > - return ret > 0 ? -EHWPOISON : -EFAULT; > + > + return ret >= 0 ? -EHWPOISON : -EFAULT; IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already sent to the process and kill_me_maybe() doesn't have to send it again. But with your change, kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break the semantics of -EHWPOISON? BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always return -EHWPOISON if this patch is applied. Correct me if I miss something. Thanks. .