When a memory error, CPU error, PCIe error, or other type of hardware error that's covered by RAS occurs, firmware should populate the shared GHES memory location with the proper GHES structures to notify the OS of the error. For example, platforms that implement firmware first handling may implement separate GHES sources for corrected errors and uncorrected errors. If the error is an uncorrectable error, then the firmware will notify the OS immediately since the error needs to be handled ASAP. The OS will then be able to take the appropriate action needed such as offlining a page. If the error is a corrected error, then the firmware will not interrupt the OS immediately. Instead, the OS will see and report the error the next time it's GHES timer expires. The kernel will first parse the GHES structures and report the errors through the kernel logs and then notify the user space through RAS trace events. This allows user space applications such as RAS Daemon to see the errors and report them however the user desires. This patchset extends the kernel functionality for RAS errors based on updates in the UEFI 2.6 and ACPI 6.1 specifications. An example flow from firmware to user space could be: +---------------+ +-------->| | | | GHES polling |--+ +-------------+ | source | | +---------------+ +------------+ | | +---------------+ | | Kernel GHES | | | | Firmware | +-->| CPER AER and |-->| RAS trace | | | +---------------+ | | EDAC drivers | | event | +-------------+ | | | +---------------+ +------------+ | | GHES sci |--+ +-------->| source | +---------------+ Add support for Generic Hardware Error Source (GHES) v2, which introduces the capability for the OS to acknowledge the consumption of the error record generated by the Reliability, Availability and Serviceability (RAS) controller. This eliminates potential race conditions between the OS and the RAS controller. Add support for the timestamp field added to the Generic Error Data Entry v3, allowing the OS to log the time that the error is generated by the firmware, rather than the time the error is consumed. This improves the correctness of event sequences when analyzing error logs. The timestamp is added in ACPI 6.1, reference Table 18-343 Generic Error Data Entry. Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6 specification. ARMv8 specific processor error information is reported as part of the CPER records. This provides more detail on for processor error logs. This can help describe ARMv8 cache, tlb, and bus errors. Synchronous External Abort (SEA) represents a specific processor error condition in ARM systems. A handler is added to recognize SEA errors, and a notifier is added to parse and report the errors before the process is killed. Refer to section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6 specification. Currently the kernel ignores CPER records that are unrecognized. On the other hand, UEFI spec allows for non-standard (eg. vendor proprietary) error section type in CPER (Common Platform Error Record), as defined in section N2.3 of UEFI version 2.5. Therefore, user is not able to see hardware error data of non-standard section. If section Type field of Generic Error Data Entry is unrecognized, prints out the raw data in dmesg buffer, and also adds a tracepoint for reporting such hardware errors. Currently even if an error status block's severity is fatal, the kernel does not honor the severity level and panic. With the firmware first model, the platform could inform the OS about a fatal hardware error through the non-NMI GHES notification type. The OS should panic when a hardware error record is received with this severity. Add support to handle SEAs that occur while a KVM guest kernel is running. Currently these are unsupported by the guest abort handling. V17:Rebase on tip Change trace event helper function names Remove unneeded prefixes from commit text V16:Rebase on 4.11 Change helper functions from #defines to inline functions Address checkpatch warnings which make sense Various parameter/variable name changes and spacing changes for better code readibility Comment why only GHESv2 needs to acknowledge the error records Define and set error structures on the same line Change timestamp printing function name to cper_print_tstamp Update timestamp to be a single print again and specify when it's precise or imprecise Only print section length when the length check fails in the ARM CPER record parsing Remove version and length print of ARM error info structures Spell out Multiprocessor Affinity Register (MPIDR) and Power State Coordination Interface (PSCI) Combine invalid context prints for ARM context info parsing Only print register context type as a string ifdef around the CPER ARM code to fix x86 compilation failure and remove the enabled check from the if statement Use BIT() for CPER ARM #defines Only call nmi_enter/exit when interrupts are enabled Move GHES panic code into a single function Add trace function prototypes in ras.c/ras.h to avoid ifdefs Change UUID definition to guarantee we don't overflow the u8 array Comment what ghes_notify_sea return means V15:Rebase on 4.11-rc7 Use wrapper functions for [un]mapping kernel acknowledgment register Spacing and name changes to make code cleaner Break up timestamp print to be more readable Break generic error data v3 structure handling code into separate patch and have timestamp handling in it's own patch Put ARM CPER handling into ifdef for ARM systems Add braces and missing space to KVM patch V14:Make sure function prototypes are in the __ASSEMBLY__ block Change is_abort_synchronous to is_abort_sea Use phys_addr_t for SEA address Return after successful SEA handling in handle_guest_abort() V13:Rebase on 4.11rc2 Print decimal and hex sizes for unknown CPER section errors Use proper CONFIG_* when using IS_ENABLED Move handle_guest_sea call prior to SEI check Add a return value to handle_guest_sea Move RCU locking into ghes_notify_sea Add valid bit checks to ARM trace event Remove GPIO, SEI, and GSIV cases in GHES Add ARCH_HAVE_NMI_SAFE_CMPXCHG since we added NMI usage V12:Remove double quotes from CPER code Add helper function to check all SEA cases in KVM patch Replace nmi_enter/exit with rcu_read_lock/unlock for KVM SEA Change HAVE_ACPI_APEI_SEA to ACPI_APEI_SEA in KVM SEA case V11:Change print_hex_dump calls to include ASCII output Change HAVE_ACPI_APEI_SEA to ACPI_APEI_SEA and make it 'default y' Add unknown print back when printing unknown CPER section Make sure to use "%s"" in CPER code Spacing fix when checking if SEA is enabled V10:Fix spacing of trace event enabled if statement V9: Move SEA_FnV_MASK to ESR_ELx_FnV Move HAVE_NMI into alphabetical order Remove duplicate hardirq.h include Only call ghes_notify_sea if HAVE_ACPI_APEI_SEA Make ACPI_APEI_SEA depend on ACPI_APEI_GHES Use phys_addr_t for physical address variable Make ghes_sea_add() return void Add include guard to ghes.h Verify HAVE_RAS before calling ras trace events Call __ghes_print_estatus() before __ghes_call_panic() Add trace_*_event_enabled() checks for both new trace events V8: Remove SEA notifier Add FAR not valid bit check when populating the SEA error address Move nmi_enter/exit() to architecture specific code Add synchronize_rcu() usage to SEA handling Make GHES_IOREMAP_PAGES always 2 Update ghes_ioremap_pfn_nmi() to work like ghes_ioremap_pfn_irq() Remove the SEA print from handle_guest_sea() V7: Update a couple prints for ARM processor errors Add Print notifying if overflow occurred for ARM processor errors Check for ARM configuration to allow the compiler to ignore ARM code on non-ARM systems Use SEA acronym instead of spelling it out Update fault_info prints to be more clear Add NMI locking to SEA notification Remove error info structure from ARM trace event since there can be a variable amount of these structures V6: Change HEST_TYPE_GENERIC_V2 to IS_HEST_TYPE_GENERIC_V2 for readability Move APEI helper defines from cper.h to ghes.h Add data_len decrement back into print loop Change references to ARMv8 to just ARM Rewrite ARM processor context info parsing Check valid bit of ARM error info field before printing it Add include of linux/uuid.h in ghes.c V5: Fix GHES goto logic for error conditions Change ghes_do_read_ack to ghes_ack_error Make sure data version check is >= 3 Use CPER helper functions in print functions Make handle_guest_sea() dummy function static for arm Add arm to subject line for KVM patch V4: Add bit offset left shift to read_ack_write value Make HEST generic and generic_v2 structures a union in the ghes structure Move gdata v3 helper functions into ghes.h to avoid duplication Reorder the timestamp print and avoid memcpy Add helper functions for gdata size checking Rename the SEA functions Add helper function for GHES panics Set fru_id to NULL UUID at variable declaration Limit ARM trace event parameters to the needed structures Reorder the ARM trace event variables to save space Add comment for why we don't pass SEAs to the guest when it aborts Move ARM trace event call into GHES driver instead of CPER V3: Fix unmapped address to the read_ack_register in ghes.c Add helper function to get the proper payload based on generic data entry version Move timestamp print to avoid changing function calls in cper.c Remove patch "arm64: exception: handle instruction abort at current EL" since the el1_ia handler is already added in 4.8 Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA Add a new trace event for ARM type errors Add support to handle KVM guest SEAs V2: Add PSCI state print for the ARMv8 error type. Separate timestamp year into year and century using BCD format. Rebase on top of ACPICA 20160318 release and remove header file changes in include/acpi/actbl1.h. Add panic OS with fatal error status block patch. Add processing of unrecognized CPER error section patches with updates from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646 V1: https://lkml.org/lkml/2016/2/5/544 Jonathan (Zhixiong) Zhang (1): acpi: apei: panic OS with fatal error status block Tyler Baicar (10): acpi: apei: read ack upon ghes record consumption ras: acpi/apei: cper: add support for generic data v3 structure cper: add timestamp print to CPER status printing efi: parse ARM processor error arm64: exception: handle Synchronous External Abort acpi: apei: handle SEA notification type for ARMv8 efi: print unrecognized CPER section ras: acpi / apei: generate trace event for unrecognized CPER section trace, ras: add ARM processor error trace event arm/arm64: KVM: add guest SEA support arch/arm/include/asm/kvm_arm.h | 10 ++ arch/arm/include/asm/system_misc.h | 5 + arch/arm64/Kconfig | 2 + arch/arm64/include/asm/esr.h | 1 + arch/arm64/include/asm/kvm_arm.h | 10 ++ arch/arm64/include/asm/system_misc.h | 2 + arch/arm64/mm/fault.c | 80 +++++++++++-- drivers/acpi/apei/Kconfig | 15 +++ drivers/acpi/apei/ghes.c | 212 +++++++++++++++++++++++++++++------ drivers/acpi/apei/hest.c | 7 +- drivers/firmware/efi/cper.c | 204 ++++++++++++++++++++++++++++++--- drivers/ras/ras.c | 16 ++- include/acpi/ghes.h | 48 +++++++- include/linux/cper.h | 54 +++++++++ include/linux/ras.h | 15 +++ include/ras/ras_event.h | 90 +++++++++++++++ include/uapi/linux/uuid.h | 6 +- virt/kvm/arm/mmu.c | 36 +++++- 18 files changed, 743 insertions(+), 70 deletions(-) -- Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.