在 2022/9/28 AM1:47, Luck, Tony 写道: > I follow and agree with everything up until: > >> In a conclusion, the error will be handled in a kworker with or without this fix. > > It isn't handled during the interrupt (it can't be). Yes, it is not handled during the interrupt and it does not have to. > > Who handles the error if the interrupt happens during the execution of a kthread? As I mentioned, the GHES driver always queues work into workqueue to handle memory failure of a page in memory_failure_queue(), so the **worker will be scheduled and handle memory failure later**. > > Can't use the task_work_add() trick to handle it (because this thread never returns to user mode). Yes, it can not. And this is the key point to fix. > > So how is the error handled? > The workflow to handle hardware error is summery as bellow: ----------------------------------------------------------------------------- [ghes_sdei_critical_callback: current swapper/3, CPU 3] ghes_sdei_critical_callback => __ghes_sdei_callback => ghes_in_nmi_queue_one_entry // peak and read estatus => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work [ghes_sdei_critical_callback: return] ----------------------------------------------------------------------------- [ghes_proc_in_irq: current swapper/3, CPU 3] => ghes_do_proc => ghes_handle_memory_failure => ghes_do_memory_failure => memory_failure_queue // put work task on current CPU => if (kfifo_put(&mf_cpu->fifo, entry)) schedule_work_on(smp_processor_id(), &mf_cpu->work); => task_work_add(current, &estatus_node->task_work, TWA_RESUME); // fix here, always added to current [ghes_proc_in_irq: return] ----------------------------------------------------------------------------- // kworker preempts swapper/3 on CPU 3 due to RESCHED flag [memory_failure_work_func: current kworker, CPU 3] => memory_failure_work_func(&mf_cpu->work) => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work => soft/hard offline ----------------------------------------------------------------------------- STEP 0: The firmware notifies hardware error to kernel through is SDEI (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED). STEP 1: In SDEI callback (or any NMI-like handler), memory from ghes_estatus_pool is used to save estatus, and added to the ghes_estatus_llist. The swapper running on CPU 3 is interrupted. irq_work_queue() causes ghes_proc_in_irq() to run in IRQ context where each estatus in ghes_estatus_llist is processed. STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU in workqueue and add task work to sync with the workqueue. STEP3: The kworker preempts the current running thread and get CPU 3. Then memory failure is processed in kworker. (STEP4 for user thread: ghes_kick_task_work() is called as task_work to ensure any queued workqueue has been done before returning to user-space. The estatus_node is freed.) If the task work is not added, estatus_node->task_work.func will be NULL, and estatus_node is freed in STEP 2. Hope it helps to make the problem clearer. You can also check the stack dumped in key function in above flow. Best Regards, Shuai --------------------------------------------------------------------------------------- dump_stack() is added in: - __ghes_sdei_callback() - ghes_proc_in_irq() - memory_failure_queue_kick() - memory_failure_work_func() - memory_failure() [ 485.457761] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G E 6.0.0-rc5+ #33 [ 485.457769] Hardware name: xxxx [ 485.457771] Call trace: [ 485.457772] dump_backtrace+0xe8/0x12c [ 485.457779] show_stack+0x20/0x50 [ 485.457781] dump_stack_lvl+0x68/0x84 [ 485.457785] dump_stack+0x18/0x34 [ 485.457787] __ghes_sdei_callback+0x24/0x64 [ 485.457789] ghes_sdei_critical_callback+0x5c/0x94 [ 485.457792] sdei_event_handler+0x28/0x90 [ 485.457795] do_sdei_event+0x74/0x160 [ 485.457797] __sdei_handler+0x60/0xf0 [ 485.457799] __sdei_asm_handler+0xbc/0x18c [ 485.457801] cpu_do_idle+0x14/0x80 [ 485.457802] default_idle_call+0x50/0x114 [ 485.457804] cpuidle_idle_call+0x16c/0x1c0 [ 485.457806] do_idle+0xb8/0x110 [ 485.457808] cpu_startup_entry+0x2c/0x34 [ 485.457809] secondary_start_kernel+0xf0/0x144 [ 485.457812] __secondary_switched+0xb0/0xb4 [ 485.459513] EDAC MC0: 1 UE multi-symbol chipkill ECC on unknown memory (node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 page:0x89c033 offset:0x400 grain:1 - APEI location: node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory) [ 485.459523] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 485.470607] {2}[Hardware Error]: event severity: recoverable [ 485.476252] {2}[Hardware Error]: precise tstamp: 2022-09-29 09:31:27 [ 485.482678] {2}[Hardware Error]: Error 0, type: recoverable [ 485.488322] {2}[Hardware Error]: section_type: memory error [ 485.494052] {2}[Hardware Error]: error_status: Storage error in DRAM memory (0x0000000000000400) [ 485.503081] {2}[Hardware Error]: physical_address: 0x000000089c033400 [ 485.509680] {2}[Hardware Error]: node:0 card:3 module:0 rank:0 bank_group:0 bank_address:0 device:0 row:624 column:384 chip_id:0 [ 485.521487] {2}[Hardware Error]: error_type: 5, multi-symbol chipkill ECC [ 485.528439] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G E 6.0.0-rc5+ #33 [ 485.528440] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022 [ 485.528441] Call trace: [ 485.528441] dump_backtrace+0xe8/0x12c [ 485.528443] show_stack+0x20/0x50 [ 485.528444] dump_stack_lvl+0x68/0x84 [ 485.528446] dump_stack+0x18/0x34 [ 485.528448] ghes_proc_in_irq+0x220/0x250 [ 485.528450] irq_work_single+0x30/0x80 [ 485.528453] irq_work_run_list+0x4c/0x70 [ 485.528455] irq_work_run+0x28/0x44 [ 485.528457] do_handle_IPI+0x2b4/0x2f0 [ 485.528459] ipi_handler+0x24/0x34 [ 485.528461] handle_percpu_devid_irq+0x90/0x1c4 [ 485.528463] generic_handle_domain_irq+0x34/0x50 [ 485.528465] __gic_handle_irq_from_irqson.isra.0+0x130/0x230 [ 485.528468] gic_handle_irq+0x2c/0x60 [ 485.528469] call_on_irq_stack+0x2c/0x38 [ 485.528471] do_interrupt_handler+0x88/0x90 [ 485.528472] el1_interrupt+0x48/0xb0 [ 485.528475] el1h_64_irq_handler+0x18/0x24 [ 485.528476] el1h_64_irq+0x74/0x78 [ 485.528477] __do_softirq+0xa4/0x358 [ 485.528478] __irq_exit_rcu+0x110/0x13c [ 485.528479] irq_exit_rcu+0x18/0x24 [ 485.528480] el1_interrupt+0x4c/0xb0 [ 485.528482] el1h_64_irq_handler+0x18/0x24 [ 485.528483] el1h_64_irq+0x74/0x78 [ 485.528484] arch_cpu_idle+0x18/0x40 [ 485.528485] default_idle_call+0x50/0x114 [ 485.528487] cpuidle_idle_call+0x16c/0x1c0 [ 485.528488] do_idle+0xb8/0x110 [ 485.528489] cpu_startup_entry+0x2c/0x34 [ 485.528491] secondary_start_kernel+0xf0/0x144 [ 485.528493] __secondary_switched+0xb0/0xb4 [ 485.528511] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G E 6.0.0-rc5+ #33 [ 485.528513] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022 [ 485.528514] Workqueue: events memory_failure_work_func [ 485.528518] Call trace: [ 485.528519] dump_backtrace+0xe8/0x12c [ 485.528520] show_stack+0x20/0x50 [ 485.528521] dump_stack_lvl+0x68/0x84 [ 485.528523] dump_stack+0x18/0x34 [ 485.528525] memory_failure_work_func+0xec/0x180 [ 485.528527] process_one_work+0x1f4/0x460 [ 485.528528] worker_thread+0x188/0x3e4 [ 485.528530] kthread+0xd0/0xd4 [ 485.528532] ret_from_fork+0x10/0x20 [ 485.528533] CPU: 3 PID: 12696 Comm: kworker/3:0 Tainted: G E 6.0.0-rc5+ #33 [ 485.528534] Hardware name: AlibabaCloud AliServer-Xuanwu2.0AM-02-2UC1P-5B/M, BIOS 1.2.M1.AL.E.132.01 08/23/2022 [ 485.528535] Workqueue: events memory_failure_work_func [ 485.528537] Call trace: [ 485.528538] dump_backtrace+0xe8/0x12c [ 485.528539] show_stack+0x20/0x50 [ 485.528540] dump_stack_lvl+0x68/0x84 [ 485.528541] dump_stack+0x18/0x34 [ 485.528543] memory_failure+0x50/0x438 [ 485.528544] memory_failure_work_func+0x174/0x180 [ 485.528546] process_one_work+0x1f4/0x460 [ 485.528547] worker_thread+0x188/0x3e4 [ 485.528548] kthread+0xd0/0xd4 [ 485.528550] ret_from_fork+0x10/0x20 [ 485.530622] Memory failure: 0x89c033: recovery action for dirty LRU page: Recovered