On 2022/12/6 AM12:00, Xie XiuQi wrote: > This series fix some issue for arm64 synchronous External Data Abort. > > 1. fix unhandled processor error > According to the RAS documentation, if we cannot determine the impact > of the error based on the details of the error when an SEA occurs, the > process cannot safely continue to run. Therefore, for unhandled error, > we should signal the system and terminate the process immediately. > > 2. improve for handling memory errors > > If error happened in current execution context, we need pass > MF_ACTION_REQUIRED flag to memory_failure(), and if memory_failure() > recovery failed, we must handle this case, other than ignore it. > > --- > v3: add improve for handing memory errors > v2: fix compile warning reported by kernel test robot. > > Xie XiuQi (4): > ACPI: APEI: include missing acpi/apei.h > arm64: ghes: fix error unhandling in synchronous External Data Abort > arm64: ghes: handle the case when memory_failure recovery failed > arm64: ghes: pass MF_ACTION_REQUIRED to memory_failure when sea > > arch/arm64/kernel/acpi.c | 6 ++++++ > drivers/acpi/apei/apei-base.c | 5 +++++ > drivers/acpi/apei/ghes.c | 31 ++++++++++++++++++++++++------- > include/acpi/apei.h | 1 + > include/linux/mm.h | 2 +- > mm/memory-failure.c | 24 +++++++++++++++++------- > 6 files changed, 54 insertions(+), 15 deletions(-) > Hi, XiuQi, As we discussed, if you want to fix this problem before the new UEFI version comes out, you need a another patch separated synchronous error handling into task work when SEA notification is used. Be careful that do not break error handling of other notification type. A reference code is pasted bellow. Thank you. Best Regards, Shuai ---- diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 57cae48ebc1f..1982a5e3fd8c 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -445,15 +445,71 @@ static void ghes_kick_task_work(struct callback_head *head) gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len); } +/** + * struct mce_task_work - for synchronous RAS event + * + * @twork: callback_head for task work + * @pfn: page frame number of corrupted page + * @flags: fine tune action taken + * + * Structure to pass task work to be handled before + * ret_to_user via task_work_add(). + */ +struct mce_task_work { + struct callback_head twork; + u64 pfn; + int flags; +}; + +static void memory_failure_cb(struct callback_head *twork) +{ + int rc; + struct mce_task_work *twcb = + container_of(twork, struct mce_task_work, twork); + + rc = memory_failure(twcb->pfn, twcb->flags); + kfree(twcb); + + if (!rc) + return; + /* + * -EHWPOISON from memory_failure() means that it already sent SIGBUS + * to the current process with the proper error info, + * -EOPNOTSUPP means hwpoison_filter() filtered the error event, + * + * In both cases, no further processing is required. + */ + if (ret == -EHWPOISON || ret == -EOPNOTSUPP) + return; + + pr_err("Memory error not recovered"); + force_sig(SIGBUS); +} + static bool ghes_do_memory_failure(u64 physical_addr, int flags) { unsigned long pfn; + struct mce_task_work *twcb; if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE)) return false; pfn = PHYS_PFN(physical_addr); - memory_failure_queue(pfn, flags); + + if (flags == MF_ACTION_REQUIRED && task->mm) { + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC); + if (!twcb) + return false; + + twcb->pfn = pfn; + twcb->flags = flags; + init_task_work(&twcb->twork, memory_failure_cb); + task_work_add(current, &twcb->twork, TWA_RESUME); + return false; + } else { + memory_failure_queue(pfn, flags); + } + return true; } -- 2.20.1.12.g72788fdb