在 2025/2/25 06:01, Borislav Petkov 写道:
On Fri, Feb 21, 2025 at 02:05:28PM +0800, Shuai Xue wrote:
#perf script
kworker/48:1-mm 25516 [048] 1713.893549: probe:memory_failure: (ffffffffaa622db4)
ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])
einj_mem_uc 44530 [184] 1713.908089: probe:memory_failure: (ffffffffaa622db4)
ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)
einj_mem_uc 44531 [089] 1713.916319: probe:memory_failure: (ffffffffaa622db4)
ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)
What are those stack traces supposed to say?
Two processes are injecting, cause a #MC and a kworker gets to handle the UC?
All injecting to the same page?
Yes, I inject poison to a page and create two process with pthread_create() which
trigger the same poison page.
What's the upper limit on CPUs seeing the same hw error and all raising
a CMCI/#MC?
It depends on the forked process which trying to read the poison.
- kill_accessing_process() is only called when the flags are set to
MF_ACTION_REQUIRED, which means it is in the MCE path.
- Whether the page is clean determines the behavior of try_to_unmap. For a
dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
and simply unmaps the PTE.
- When does walk_page_range() with hwpoison_walk_ops return 1?
1. If the poison page still exists, we should of course kill the current
process.
2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
it is a dirty page, we should also kill the current process, too.
3. Otherwise, it returns 0, which means the page is clean.
I think you're too deep into detail. What I'd do is step back, think what
would be the *proper* recovery action and then make sure memory_failure does
that. If it doesn't - fix it to do so.
So, what should really happen wrt recovery action if any number of CPUs see
the same memory error?
IMHO, we should send a SIGBUS signal to the processes running on the CPUs that
detect a memory error for dirty page, which is the current behavior in the
memory_failure.
Thanks
Shuai