RE: [PATCH] x86/mce: set MCE_IN_KERNEL_COPYIN for all MC-Safe Copy

"Luck, Tony" <tony.luck@xxxxxxxxx> · Mon, 22 May 2023 18:02:29 +0000

>> Is this patch in addition to, or instead of, the earlier core dump patch?
>
> This is an addition, in previous coredump patch, manually call 
> memory_failure_queue()
> to be asked to cope with corrupted page, and it is similar to your
> "Copy-on-write poison recovery"[1], but after some discussion, I think
> we could add MCE_IN_KERNEL_COPYIN to all MC-safe copy, which will
> cope with corrupted page in the core do_machine_check() instead of
> do it one-by-one.

Thanks for the context. I see how this all fits together now).

Your patch looks good.

Reviewed-by: Tony Luck <tony.luck@xxxxxxxxx>

-Tony

One small observation from testing. I injected to an application which consumed
the poisoned data and was sent a SIGBUS.

Kernel did not crash (hurrah!)

Console log said:

[  417.610930] mce: [Hardware Error]: Machine check events logged
[  417.618372] Memory failure: 0x89167f: recovery action for dirty LRU page: Recovered
... EDAC messages
[  423.666918] MCE: Killing testprog:4770 due to hardware memory corruption fault at 7f8eccf35000

A core file was generated and saved in /var/lib/systemd/coredump

But my shell (/bin/bash) only said:

Bus error

not

Bus error (core dumped)

-Tony