Re: [PATCH] mm,hwpoison: non-current task should be checked early_kill for force_early

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 18, 2021 at 01:57:44PM +0800, Aili Yao wrote:
> On Mon, 18 Jan 2021 05:15:55 +0000
> HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@xxxxxxx> wrote:
> 
> > Hi Aili,
> > 
> > On Fri, Jan 15, 2021 at 05:26:22PM +0800, Aili Yao wrote:
> > > On Fri, 15 Jan 2021 09:49:24 +0100
> > > Oscar Salvador <osalvador@xxxxxxx> wrote:
> > >   
> > > > I am having a hard time trying to grasp what are you trying to achieve here.
> > > > Could you elaborate some more? Ideally stating what is the problem you are
> > > > fixing here.
> > > >   
> > > Sorry for confusion, example: there are four process A,B,C,D,which map the same file into
> > > there process space, which set there PF_MCE_KILL_EARLY flag to TRUE, if process A trigger one
> > > UE with  MF_ACTION_REQUIRED set, in current code, only process A will be killed, B,C,D remain
> > > alive, but for the PF_MCE_KILL_EARLY we set, we want B,C,D also be killed.  
> > 
> > This behavior seems not to me what PF_MCE_KILL_EARLY intends.  This flag
> > controls whether memory error handler kills processes immediately or not,
> > and it only affects action optional cases (i.e. called without
> > MF_ACTION_REQUIRED).  In MF_ACTION_REQUIRED case, we have no such choice
> > and affected processes should be always killed immediately.
> > 
> > We may also need to consider the difference in context of these two cases.
> > Action optional case is called asynchronously by background process like
> > memory scrubbing, so all processes mapping the error memory are the affected
> > ones.  Action required event is more synchronous, and is called when a
> > process experiences memory access errors on data load and instruction fetch
> > instructions.  So the affected process in this case is only the process.
> > So I still think the this background justifies the current behavior.
> > 
> > But my knowledge might be old, if you have newer hardwares which define
> > other type of memory error and that doesn't fit with current implementation,
> > I'd like to extend code to support the new cases, so please let me know.
> > 
> Sorry, I don't fully get your concern.
> 
> For Action optional cases, It's may from CE storm or patrol scrub, ...

hwpoison is not about corrected errors, but about uncorrected errors. CE storm
should be handled by CMCI and userspace tool like mcelog, although it seems not
current main topic, sorry for nitpick.

> when the process want to process this condition,
> it will set PF_MCE_KILL_EARLY, and it will be signaled for such case.
> For Action Required cases,we must do something, I think it's more urgent and serious, In the current code, the process triggered the Error
> Should be signaled. but the process with PF_MCE_KILL_EARLY won't get signaled, just because PF_MCE_KILL_EARLY is for action optional case?

I don't use PF_MCE_KILL_EARLY to justify current code. Let me explain more.

For action optional cases, one error event kills *only one* process. If an
error page are shared by multiple processes, these processes will be killed
by separate error events, each of which is triggered when each process tries
to access the error memory.  So these processes would be killed immediately
when accessing the error, but you don't have to kill all at the same time
(or actually you might not even have to kill it at all if the process exits
finally without accessing the error later).

Maybe the function variable "force_early" is named confusingly (it sounds
that it's related to PF_MCE_KILL_EARLY flag, but that's incorrect).
I'll submit a fix later.  (I'll add your "Reported-by" because you made me
find it, thank you.)

> 
> Action Required is for current we must handle, the same Action Required issue is Action optional for non-current processes, Right?

Right.

> I don't think Action Required is for all processes, For current processes , it may be AR, for other process, it may be AO, and they should also
> be signaled, I think this behavior its reasonable. 
> 
> And we can't determine which error will be triggered, the PF_MCE_KILL_EARLY fLAG is meant to handle memory error gracefully and won't be restricted
> to explicitly declared AO errors.
> 
> Thanks!

Thank you, too.

- Naoya




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux