Re: [PATCH v2] ACPI: APEI: do not add task_work to kernel thread to avoid memory leak

Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> · Fri, 30 Sep 2022 10:52:44 +0800

在 2022/9/30 AM4:52, Luck, Tony 写道:
> Thanks for your patient explanations.

You are welcome :)

> 
>> STEP2: In IRQ context, ghes_proc/_in_irq() queues memory failure work on current CPU
>> in workqueue and add task work to sync with the workqueue.
> 
> Why is there a difference if the interrupted task was a user task vs. a kernel thread?
> 
> It seems arbitrary. If the error can be handled in the kernel thread case without
> a task_work_add() to the current process, can't all errors be handled this way?

I'm afraid not. The kworker in workqueue is asynchronous with ret_to_user() of the
interrupted task. If we return to user-space before the queued memory_failure() work
is processed, we will take the fault again when the error is signal by synchronous
external abort. This loop may cause platform firmware to exceed some threshold and
reboot.

When a user task consuming poison data, a synchronous external abort will be signaled,
for example "einj_mem_uc single" in ras-tools. In such case, the handling flow will
be like bellow:

----------------------------------STEP 0-------------------------------------------
[ghes_sdei_critical_callback: current einj_mem_uc, local cpu]
ghes_sdei_critical_callback
    => __ghes_sdei_callback
        => ghes_in_nmi_queue_one_entry: peak and read estatus
        => irq_work_queue(&ghes_proc_irq_work) // ghes_proc_in_irq - irq_work
[ghes_sdei_critical_callback: return]
-----------------------------------STEP 1------------------------------------------
[ghes_proc_in_irq: current einj_mem_uc, local cpu]
            => ghes_do_proc
                => ghes_handle_memory_failure
                    => ghes_do_memory_failure
                        => memory_failure_queue	- put work task on a specific cpu
                            => if (kfifo_put(&mf_cpu->fifo, entry))
                                  schedule_work_on(smp_processor_id(), &mf_cpu->work);
            => task_work_add(current, &estatus_node->task_work, TWA_RESUME);
[ghes_proc_in_irq: return]
-----------------------------------STEP 3------------------------------------------
// kworker preempts einj_mem_uc on local cpu due to RESCHED flag
[memory_failure_work_func: current kworker, local cpu]	
     => memory_failure_work_func(&mf_cpu->work)
        => while kfifo_get(&mf_cpu->fifo, &entry);	// until get no work
            => soft/hard offline
------------------------------------STEP 4-----------------------------------------
[ghes_kick_task_work: current einj_mem_uc, other cpu]
                => memory_failure_queue_kick
                    => cancel_work_sync //wait memory_failure_work_func finish
                    => memory_failure_work_func(&mf_cpu->work)
                        => kfifo_get(&mf_cpu->fifo, &entry); // no work here
------------------------------------STEP 5-----------------------------------------
[current einj_mem_uc returned to userspace]
                => Killed by SIGBUS

STEP 4 add a task work to ensure the queued memory_failure() work is processed before
returning to user-space. And the interrupted user will be killed by SIGBUS signal.

If we delete STEP 4, the interrupted user task will return to user space synchronously
and consume the poison data again.

> 
> The current thread likely has nothing to do with the error. Just a matter of chance
> on what is running when the NMI is delivered, right?

Yes, the error is actually handled in workqueue. I think the point is that the
synchronous exception signaled by synchronous external abort must be handled
synchronously, otherwise, it will be signaled again.

Best Regards,
Shuai