在 2022/10/29 AM1:08, Rafael J. Wysocki 写道: > On Thu, Oct 27, 2022 at 6:25 AM Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> wrote: >> >> There are two major types of uncorrected error (UC) : >> >> - Action Required: The error is detected and the processor already consumes the >> memory. OS requires to take action (for example, offline failure page/kill >> failure thread) to recover this uncorrectable error. >> >> - Action Optional: The error is detected out of processor execution context. >> Some data in the memory are corrupted. But the data have not been consumed. >> OS is optional to take action to recover this uncorrectable error. >> >> For X86 platforms, we can easily distinguish between these two types >> based on the MCA Bank. While for arm64 platform, the memory failure >> flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0, >> a.k.a, Action Optional now. >> >> If UC is detected by a background scrubber, it is obviously an Action >> Optional error. For other errors, we should conservatively regard them >> as Action Required. >> >> cper_sec_mem_err::error_type identifies the type of error that occurred >> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0 >> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure >> flags as MF_ACTION_REQUIRED. >> >> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> > > I need input from the APEI reviewers on this. > > Thanks! Hi, Rafael, Sorry, I missed this email. Thank you for you quick reply. Let's discuss with reviewers. Thank you. Cheers, Shuai > >> --- >> drivers/acpi/apei/ghes.c | 10 ++++++++-- >> include/linux/cper.h | 3 +++ >> 2 files changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index 80ad530583c9..6c03059cbfc6 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, >> if (sec_sev == GHES_SEV_CORRECTED && >> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) >> flags = MF_SOFT_OFFLINE; >> - if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) >> - flags = 0; >> + if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) { >> + if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE) >> + flags = mem_err->error_type == CPER_MEM_SCRUB_UC ? >> + 0 : >> + MF_ACTION_REQUIRED; >> + else >> + flags = MF_ACTION_REQUIRED; >> + } >> >> if (flags != -1) >> return ghes_do_memory_failure(mem_err->physical_addr, flags); >> diff --git a/include/linux/cper.h b/include/linux/cper.h >> index eacb7dd7b3af..b77ab7636614 100644 >> --- a/include/linux/cper.h >> +++ b/include/linux/cper.h >> @@ -235,6 +235,9 @@ enum { >> #define CPER_MEM_VALID_BANK_ADDRESS 0x100000 >> #define CPER_MEM_VALID_CHIP_ID 0x200000 >> >> +#define CPER_MEM_SCRUB_CE 13 >> +#define CPER_MEM_SCRUB_UC 14 >> + >> #define CPER_MEM_EXT_ROW_MASK 0x3 >> #define CPER_MEM_EXT_ROW_SHIFT 16 >> >> -- >> 2.20.1.9.gb50a0d7 >>