Re: [PATCH] mm,hwpoison: non-current task should be checked early_kill for force_early

HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@xxxxxxx> · Mon, 18 Jan 2021 08:57:47 +0000

On Mon, Jan 18, 2021 at 04:15:12PM +0800, Aili Yao wrote:
> On Mon, 18 Jan 2021 06:50:54 +0000
> HORIGUCHI NAOYA(堀口　直也) <naoya.horiguchi@xxxxxxx> wrote:
> 
> > 
> > For action optional cases, one error event kills *only one* process. If an
> > error page are shared by multiple processes, these processes will be killed
> > by separate error events, each of which is triggered when each process tries
> > to access the error memory.  So these processes would be killed immediately
> > when accessing the error, but you don't have to kill all at the same time
> > (or actually you might not even have to kill it at all if the process exits
> > finally without accessing the error later).
> > 
> > Maybe the function variable "force_early" is named confusingly (it sounds
> > that it's related to PF_MCE_KILL_EARLY flag, but that's incorrect).
> > I'll submit a fix later.  (I'll add your "Reported-by" because you made me
> > find it, thank you.)
> > 
> I think we should do more for non current process error case, we should mark it AO for processes to be signaled
> or we may take wrong action.

I'm not sure what you mean by "non current process error case" and "we
should mark it AO", so could you explain more specifically about your error
scenario?  Especially I'd like to know about who triggers hard offline on
what hardware events and what "wrong action" could happen.  Maybe just
"calling memory_failure() with MF_ACTION_REQUIRED" is not enough, because
it's not enough for us to see that your scenario is possible. Current
implementation implicitly assumes some hardware behavior, and does not work
for the case which never happens under the assumption.

Do you have some test cases to reproduce any specific issue (like data lost)
on your system? (If yes, please share it.) Or your concern is from code review?

Thanks,
Naoya Horiguchi