On Fri, May 21, 2021 at 12:01:55PM +0900, Naoya Horiguchi wrote: > From: Aili Yao <yaoaili@xxxxxxxxxxxx> > > When memory_failure() is called with MF_ACTION_REQUIRED on the > page that has already been hwpoisoned, memory_failure() could fail > to send SIGBUS to the affected process, which results in infinite > loop of MCEs. > > Currently memory_failure() returns 0 if it's called for already > hwpoisoned page, then the caller, kill_me_maybe(), could return > without sending SIGBUS to current process. An action required MCE > is raised when the current process accesses to the broken memory, > so no SIGBUS means that the current process continues to run and > access to the error page again soon, so running into MCE loop. > > This issue can arise for example in the following scenarios: > > - Two or more threads access to the poisoned page concurrently. > If local MCE is enabled, MCE handler independently handles the > MCE events. So there's a race among MCE events, and the > second or latter threads fall into the situation in question. > > - If there was a precedent memory error event and memory_failure() > for the event failed to unmap the error page for some reason, > the subsequent memory access to the error page triggers the > MCE loop situation. > > To fix the issue, make memory_failure() return an error code when the > error page has already been hwpoisoned. This allows memory error > handler to control how it sends signals to userspace. And make sure > that any process touching a hwpoisoned page should get a SIGBUS even > in "already hwpoisoned" path of memory_failure() as is done in page > fault path. > > Signed-off-by: Aili Yao <yaoaili@xxxxxxxxxxxx> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx> Reviewed-by: Oscar Salvador <osalvador@xxxxxxx> -- Oscar Salvador SUSE L3