在 2022/5/5 21:41, Catalin Marinas 写道:
On Thu, May 05, 2022 at 02:39:43PM +0800, Tong Tiangen wrote:
在 2022/5/4 18:26, Catalin Marinas 写道:
On Wed, Apr 20, 2022 at 03:04:15AM +0000, Tong Tiangen wrote:
Add copy_{to, from}_user() to machine check safe.
If copy fail due to hardware memory error, only the relevant processes are
affected, so killing the user process and isolate the user page with
hardware memory errors is a more reasonable choice than kernel panic.
Just to make sure I understand - we can only recover if the fault is in
a user page. That is, for a copy_from_user(), we can only handle the
faults in the source address, not the destination.
At the beginning, I also thought we can only recover if the fault is in a
user page.
After discussion with a Mark[1], I think no matter user page or kernel page,
as long as it is triggered by the user process, only related processes will
be affected. According to this
understanding, it seems that all uaccess can be recovered.
[1]https://patchwork.kernel.org/project/linux-arm-kernel/patch/20220406091311.3354723-6-tongtiangen@xxxxxxxxxx/
We can indeed safely skip this copy and return an error just like
pretending there was a user page fault. However, my point was more
around the "isolate the user page with hardware memory errors". If the
fault is on a kernel address, there's not much you can do about. You'll
likely trigger it later when you try to access that address (maybe it
was freed and re-allocated). Do we hope we won't get the same error
again on that kernel address?
I think the page with memory error will be isolated by memory_failure(),
generally, isolation will succeed, if isolate failure(we need to find
out why), then maybe the same error will trigger it later.
Thanks.