Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





在 2025/2/28 20:35, Borislav Petkov 写道:
On Tue, Feb 25, 2025 at 09:51:25AM +0800, Shuai Xue wrote:
It depends on the forked process which trying to read the poison.

And? Can you try creating more processes and see what happens then?


Sure.

The experimental model includes:

1. inject UE to a memory buffer
2. create 10 processes
3. all 10 process read the posioned buffer
4. 10 MCEs and 1 UCNA will be triggered
5. each process receives a SIGBUS

Some details:

#perf record -e probe:memory_failure -agR -- ./einj_mem_uc thread
0: thread   vaddr = 0x7f65f08da400 paddr = 82702ec400
injecting ...
trigger_thread
trigger_thread
trigger_thread
trigger_thread
trigger_thread
trigger_thread
trigger_thread
trigger_thread
trigger_thread
trigger_thread
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
Test passed
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.640 MB perf.data (11 samples) ]


#perf script
einj_mem_uc 1722254 [151] 695128.161644: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722255 [014] 695128.161712: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722256 [153] 695128.161716: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722257 [124] 695128.161759: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722258 [154] 695128.161782: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722259 [026] 695128.161819: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722260 [157] 695128.161852: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722261 [158] 695128.161895: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

kworker/50:3-mm 1714430 [050] 695128.168736: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
        ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
        ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
        ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
        ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
        ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
        ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
        ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
        ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])

einj_mem_uc 1722252 [050] 695128.183025: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722253 [051] 695128.191348: probe:memory_failure: (ffffffffaa622db4)
        ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
        ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
        ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
        ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
        ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
        ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                  405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

IMHO, we should send a SIGBUS signal to the processes running on the CPUs that
detect a memory error for dirty page, which is the current behavior in the
memory_failure.

And for all those other processes which do get to see the already
poisoned/clean page, they should continue on their merry way instead of
getting killed by a SIGBUS?


Yes, memory_failure() only sends a SIGBUS signal to the process that
is actively reading a poisoned page. Other processes that share the
poisoned page will not receive a SIGBUS signal unless they have the
PF_MCE_EARLY flag set.[1]

[1]https://lkml.kernel.org/r/20220218090118.1105-4-linmiaohe@xxxxxxxxxx

Thanks.
Shuai






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux