Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> · Sun, 23 Oct 2022 23:04:22 +0800



在 2022/10/22 AM12:30, Luck, Tony 写道:
>>> But maybe it is some RMW instruction ... then, if all the above options didn't happen ... we
>>> could get another machine check from the same address. But then we just follow the usual
>>> recovery path.
> 
> 
>> Let assume the instruction that cause the COW is in the 63/64 case, aka,
>> it is writing a different cache line from the poisoned one. But the new_page
>> allocated in COW is dropped right? So might page fault again?
> 
> It can, but this should be no surprise to a user that has a signal handler for
> a h/w event (SIGBUS, SIGSEGV, SIGILL) that does nothing to address the
> problem, but simply returns to re-execute the same instruction that caused
> the original trap.
> 
> There may be badly written signal handlers that do this. But they just cause
> pain for themselves. Linux can keep taking the traps and fixing things up and
> sending a new signal over and over.
> 
> In this case that loop may involve taking the machine check again, so some
> extra pain for the kernel, but recoverable machine checks on Intel/x86 switched
> from broadcast to delivery to just the logical CPU that tried to consume the poison
> a few generations back. So only a bit more painful than a repeated page fault.
> 
> -Tony
> 
> 

I see, thanks for your patient explanation :)

Best Regards,
Shuai