Hi, Tony, 在 2022/9/27 AM11:50, Shuai Xue 写道: > > > 在 2022/9/26 PM11:20, Luck, Tony 写道: >>>> >>>> - if (task_work_pending && current->mm != &init_mm) { >>>> + if (task_work_pending && current->mm) { >>>> estatus_node->task_work.func = ghes_kick_task_work; >>>> estatus_node->task_work_cpu = smp_processor_id(); >>>> ret = task_work_add(current, &estatus_node->task_work, >> >> It seems that you are getting errors reported while running kernel threads. This fix avoids >> pointlessly adding "task_work" that will never be processed because kernel threads never >> return to user mode. > > Yes, you are right. > >> >> But maybe something else needs to be done? The code was, and with this fix still is, >> taking no action for the error. That doesn't seem right. > > Sorry, I don't think so. As far as I know, on Arm platform, hardware error can signal > exceptions including: > > - Synchronous External Abort (SEA), e,g. data abort or instruction abort > - Asynchronous External Abort (signalled as SErrors), e,g. L2 can generate SError for > error responses from the interconnect for a Device or Non-cacheable store > - Fault Handling and Error Recovery interrupts: DDR mainline/demand/scrubber error interrupt > > When the error signals asynchronous exceptions (SError or interrupt), any kind of thread can > be interrupted, including kernel thread. Because asynchronous exceptions are signaled in > background, the errors are detected outside of the current execution context. > > The GHES driver always queues work to handle memory failure of a page in memory_failure_queue(). > If a kernel thread is interrupted: > > - Without this fix, the added task_work will never be processed so that the work will not > be canceled. > - With this fix, the task_work will not be added. > > In a conclusion, the error will be handled in a kworker with or without this fix. > > The point of fix is that: > > - The condition is useless because it is always tree. And I think it is not the original patch > intends to do. > - The current code leads to memory leaks. The estatus_node will not be freed when task_work is > added to a kernel thread. > > >> >> Are you injecting errors to create this scenario? > > Yes, I am injecting error to create such scenario. After 200 injections, the ghes_estatus_pool > will run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. Finally, a lot of unhandled > events may cause platform firmware to exceed some threshold and reboot. > >> What does your test do? > > My injection includes two steps: > > - mmap a device memory for userspace in a driver > - inject uc and trigger write like ras-tools does > > I have opened source the code and you can find here[1]. It's forked from your repo and mainly based > on your code :) > > By the way, do you have any plans to extend ras-tools to Arm platform. And is there a mail-list > for ras-tools? I send a mail to add test cases to ras-tools several weeks ago, but no response. > May your mailbox regards it as Spam. > Thank you for your review, but I am still having problems with this question. Do you have any interest to extend ras-tools to Arm platform? I forked a arm-devel branch[1] from your repo: - port X86 arch specific cases to Arm platform - add some common cases like hugetlb, thread and Arm specific cases like prefetch, strb, etc, which are helpful to test hardware and firmware RAS problems we encountered. I am pleasure to contribute these code to your upstream repo and looking forward to see a more powerful and cross platform tools to inject and debug RAS ability on both X86 and Arm platform. I really appreciate your great work, and look forward to your reply. Thank you. Best Regards, Shuai > [1] https://gitee.com/anolis/ras-tools/tree/arm-devel