在 2022/9/26 PM11:20, Luck, Tony 写道: >>> >>> - if (task_work_pending && current->mm != &init_mm) { >>> + if (task_work_pending && current->mm) { >>> estatus_node->task_work.func = ghes_kick_task_work; >>> estatus_node->task_work_cpu = smp_processor_id(); >>> ret = task_work_add(current, &estatus_node->task_work, > > It seems that you are getting errors reported while running kernel threads. This fix avoids > pointlessly adding "task_work" that will never be processed because kernel threads never > return to user mode. Yes, you are right. > > But maybe something else needs to be done? The code was, and with this fix still is, > taking no action for the error. That doesn't seem right. Sorry, I don't think so. As far as I know, on Arm platform, hardware error can signal exceptions including: - Synchronous External Abort (SEA), e,g. data abort or instruction abort - Asynchronous External Abort (signalled as SErrors), e,g. L2 can generate SError for error responses from the interconnect for a Device or Non-cacheable store - Fault Handling and Error Recovery interrupts: DDR mainline/demand/scrubber error interrupt When the error signals asynchronous exceptions (SError or interrupt), any kind of thread can be interrupted, including kernel thread. Because asynchronous exceptions are signaled in background, the errors are detected outside of the current execution context. The GHES driver always queues work to handle memory failure of a page in memory_failure_queue(). If a kernel thread is interrupted: - Without this fix, the added task_work will never be processed so that the work will not be canceled. - With this fix, the task_work will not be added. In a conclusion, the error will be handled in a kworker with or without this fix. The point of fix is that: - The condition is useless because it is always tree. And I think it is not the original patch intends to do. - The current code leads to memory leaks. The estatus_node will not be freed when task_work is added to a kernel thread. > > Are you injecting errors to create this scenario? Yes, I am injecting error to create such scenario. After 200 injections, the ghes_estatus_pool will run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. Finally, a lot of unhandled events may cause platform firmware to exceed some threshold and reboot. > What does your test do? My injection includes two steps: - mmap a device memory for userspace in a driver - inject uc and trigger write like ras-tools does I have opened source the code and you can find here[1]. It's forked from your repo and mainly based on your code :) By the way, do you have any plans to extend ras-tools to Arm platform. And is there a mail-list for ras-tools? I send a mail to add test cases to ras-tools several weeks ago, but no response. May your mailbox regards it as Spam. [1] https://gitee.com/anolis/ras-tools/tree/arm-devel