On 2023/5/23 2:02, Luck, Tony wrote:
Is this patch in addition to, or instead of, the earlier core dump patch?
This is an addition, in previous coredump patch, manually call
memory_failure_queue()
to be asked to cope with corrupted page, and it is similar to your
"Copy-on-write poison recovery"[1], but after some discussion, I think
we could add MCE_IN_KERNEL_COPYIN to all MC-safe copy, which will
cope with corrupted page in the core do_machine_check() instead of
do it one-by-one.
Thanks for the context. I see how this all fits together now).
Your patch looks good.
Reviewed-by: Tony Luck <tony.luck@xxxxxxxxx>
Thanks for your confirm.
-Tony
One small observation from testing. I injected to an application which consumed
the poisoned data and was sent a SIGBUS.
Kernel did not crash (hurrah!)
Yes, no crash is always great.
Console log said:
[ 417.610930] mce: [Hardware Error]: Machine check events logged
[ 417.618372] Memory failure: 0x89167f: recovery action for dirty LRU page: Recovered
... EDAC messages
[ 423.666918] MCE: Killing testprog:4770 due to hardware memory corruption fault at 7f8eccf35000
A core file was generated and saved in /var/lib/systemd/coredump
But my shell (/bin/bash) only said:
Bus error
not
Bus error (core dumped)
No sure about the effect, but since there is kernel message and mcelog,
it seems that there is no big deal for the different :)
-Tony