Re: [PATCH V6 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Shiju,

Great! Thank you for testing! :)

Tyler

On 12/13/2016 4:10 AM, Shiju Jose wrote:
Hi Tyler,

We have tested V6 patch set on our platform. It worked fine.

Thanks,
Shiju

-----Original Message-----
From: Tyler Baicar [mailto:tbaicar@xxxxxxxxxxxxxx]
Sent: 07 December 2016 21:48
To: christoffer.dall@xxxxxxxxxx; marc.zyngier@xxxxxxx;
pbonzini@xxxxxxxxxx; rkrcmar@xxxxxxxxxx; linux@xxxxxxxxxxxxxxx;
catalin.marinas@xxxxxxx; will.deacon@xxxxxxx; rjw@xxxxxxxxxxxxx;
lenb@xxxxxxxxxx; matt@xxxxxxxxxxxxxxxxxxx; robert.moore@xxxxxxxxx;
lv.zheng@xxxxxxxxx; nkaje@xxxxxxxxxxxxxx; zjzhang@xxxxxxxxxxxxxx;
mark.rutland@xxxxxxx; james.morse@xxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
eun.taik.lee@xxxxxxxxxxx; sandeepa.s.prabhu@xxxxxxxxx;
labbott@xxxxxxxxxx; shijie.huang@xxxxxxx; rruigrok@xxxxxxxxxxxxxx;
paul.gortmaker@xxxxxxxxxxxxx; tn@xxxxxxxxxxxx; fu.wei@xxxxxxxxxx;
rostedt@xxxxxxxxxxx; bristot@xxxxxxxxxx; linux-arm-
kernel@xxxxxxxxxxxxxxxxxxx; kvmarm@xxxxxxxxxxxxxxxxxxxxx;
kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
acpi@xxxxxxxxxxxxxxx; linux-efi@xxxxxxxxxxxxxxx; devel@xxxxxxxxxx;
Suzuki.Poulose@xxxxxxx; punit.agrawal@xxxxxxx; astone@xxxxxxxxxx;
harba@xxxxxxxxxxxxxx; hanjun.guo@xxxxxxxxxx; John Garry; Shiju Jose
Cc: Tyler Baicar
Subject: [PATCH V6 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on
ARM64

When a memory error, CPU error, PCIe error, or other type of hardware
error that's covered by RAS occurs, firmware should populate the shared
GHES memory location with the proper GHES structures to notify the OS
of the error.
For example, platforms that implement firmware first handling may
implement separate GHES sources for corrected errors and uncorrected
errors. If the error is an uncorrectable error, then the firmware will
notify the OS immediately since the error needs to be handled ASAP. The
OS will then be able to take the appropriate action needed such as
offlining a page. If the error is a corrected error, then the firmware
will not interrupt the OS immediately.
Instead, the OS will see and report the error the next time it's GHES
timer expires. The kernel will first parse the GHES structures and
report the errors through the kernel logs and then notify the user
space through RAS trace events. This allows user space applications
such as RAS Daemon to see the errors and report them however the user
desires. This patchset extends the kernel functionality for RAS errors
based on updates in the UEFI 2.6 and ACPI 6.1 specifications.

An example flow from firmware to user space could be:

                  +---------------+
        +-------->|               |
        |         |  GHES polling |--+
+-------------+  |    source     |  |   +---------------+   +----------
--+
|             |  +---------------+  |   |  Kernel GHES  |   |
|
|  Firmware   |                     +-->|  CPER AER and |-->|  RAS
trace |
|             |  +---------------+  |   |  EDAC drivers |   |   event
|
+-------------+  |               |  |   +---------------+   +----------
--+
        |         |  GHES sci     |--+
        +-------->|   source      |
                  +---------------+

Add support for Generic Hardware Error Source (GHES) v2, which
introduces the capability for the OS to acknowledge the consumption of
the error record generated by the Reliability, Availability and
Serviceability (RAS) controller.
This eliminates potential race conditions between the OS and the RAS
controller.

Add support for the timestamp field added to the Generic Error Data
Entry v3, allowing the OS to log the time that the error is generated
by the firmware, rather than the time the error is consumed. This
improves the correctness of event sequences when analyzing error logs.
The timestamp is added in ACPI 6.1, reference Table 18-343 Generic
Error Data Entry.

Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
specification. ARMv8 specific processor error information is reported
as part of the CPER records.  This provides more detail on for
processor error logs. This can help describe ARMv8 cache, tlb, and bus
errors.

Synchronous External Abort (SEA) represents a specific processor error
condition in ARM systems. A handler is added to recognize SEA errors,
and a notifier is added to parse and report the errors before the
process is killed. Refer to section N.2.1.1 in the Common Platform
Error Record appendix of the UEFI 2.6 specification.

Currently the kernel ignores CPER records that are unrecognized.
On the other hand, UEFI spec allows for non-standard (eg. vendor
proprietary) error section type in CPER (Common Platform Error Record),
as defined in section N2.3 of UEFI version 2.5. Therefore, user is not
able to see hardware error data of non-standard section.

If section Type field of Generic Error Data Entry is unrecognized,
prints out the raw data in dmesg buffer, and also adds a tracepoint for
reporting such hardware errors.

Currently even if an error status block's severity is fatal, the kernel
does not honor the severity level and panic. With the firmware first
model, the platform could inform the OS about a fatal hardware error
through the non-NMI GHES notification type. The OS should panic when a
hardware error record is received with this severity.

Add support to handle SEAs that occur while a KVM guest kernel is
running. Currently these are unsupported by the guest abort handling.

Depends on: [PATCH v15] acpi, apei, arm64: APEI initial support for
aarch64.
             https://lkml.org/lkml/2016/12/1/312

V6: Change HEST_TYPE_GENERIC_V2 to IS_HEST_TYPE_GENERIC_V2 for
readability
     Move APEI helper defines from cper.h to ghes.h
     Add data_len decrement back into print loop
     Change references to ARMv8 to just ARM
     Rewrite ARM processor context info parsing
     Check valid bit of ARM error info field before printing it
     Add include of linux/uuid.h in ghes.c

V5: Fix GHES goto logic for error conditions
     Change ghes_do_read_ack to ghes_ack_error
     Make sure data version check is >= 3
     Use CPER helper functions in print functions
     Make handle_guest_sea() dummy function static for arm
     Add arm to subject line for KVM patch

V4: Add bit offset left shift to read_ack_write value
     Make HEST generic and generic_v2 structures a union in the ghes
structure
     Move gdata v3 helper functions into ghes.h to avoid duplication
     Reorder the timestamp print and avoid memcpy
     Add helper functions for gdata size checking
     Rename the SEA functions
     Add helper function for GHES panics
     Set fru_id to NULL UUID at variable declaration
     Limit ARM trace event parameters to the needed structures
     Reorder the ARM trace event variables to save space
     Add comment for why we don't pass SEAs to the guest when it aborts
     Move ARM trace event call into GHES driver instead of CPER

V3: Fix unmapped address to the read_ack_register in ghes.c
     Add helper function to get the proper payload based on generic data
entry
      version
     Move timestamp print to avoid changing function calls in cper.c
     Remove patch "arm64: exception: handle instruction abort at current
EL"
      since the el1_ia handler is already added in 4.8
     Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
     Add a new trace event for ARM type errors
     Add support to handle KVM guest SEAs

V2: Add PSCI state print for the ARMv8 error type.
     Separate timestamp year into year and century using BCD format.
     Rebase on top of ACPICA 20160318 release and remove header file
changes
      in include/acpi/actbl1.h.
     Add panic OS with fatal error status block patch.
     Add processing of unrecognized CPER error section patches with
updates
      from previous comments. Original patches:
https://lkml.org/lkml/2015/9/8/646

V1: https://lkml.org/lkml/2016/2/5/544

Jonathan (Zhixiong) Zhang (1):
   acpi: apei: panic OS with fatal error status block

Tyler Baicar (9):
   acpi: apei: read ack upon ghes record consumption
   ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
   efi: parse ARM processor error
   arm64: exception: handle Synchronous External Abort
   acpi: apei: handle SEA notification type for ARMv8
   efi: print unrecognized CPER section
   ras: acpi / apei: generate trace event for unrecognized CPER section
   trace, ras: add ARM processor error trace event
   arm/arm64: KVM: add guest SEA support

  arch/arm/include/asm/kvm_arm.h       |   1 +
  arch/arm/include/asm/system_misc.h   |   5 +
  arch/arm/kvm/mmu.c                   |  18 +++-
  arch/arm64/Kconfig                   |   1 +
  arch/arm64/include/asm/kvm_arm.h     |   1 +
  arch/arm64/include/asm/system_misc.h |  15 +++
  arch/arm64/mm/fault.c                |  71 ++++++++++--
  drivers/acpi/apei/Kconfig            |  14 +++
  drivers/acpi/apei/ghes.c             | 189
+++++++++++++++++++++++++++++---
  drivers/acpi/apei/hest.c             |   7 +-
  drivers/firmware/efi/cper.c          | 204
++++++++++++++++++++++++++++++++---
  drivers/ras/ras.c                    |   2 +
  include/acpi/ghes.h                  |  27 ++++-
  include/linux/cper.h                 |  53 +++++++++
  include/ras/ras_event.h              | 100 +++++++++++++++++
  15 files changed, 664 insertions(+), 44 deletions(-)

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a
Linux Foundation Collaborative Project.

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm



[Index of Archives]     [Linux KVM]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux