Adds documentation for Ampere(R)'s Altra(R) SMpro errmon driver. Signed-off-by: Thu Nguyen <thu@xxxxxxxxxxxxxxxxxxxxxx> Signed-off-by: Quan Nguyen <quan@xxxxxxxxxxxxxxxxxxxxxx> --- Changes in v8: + Update to reflect single value per sysfs [Quan] Changes in v7: + None Changes in v6: + First introduced in v6 [Quan] Documentation/misc-devices/index.rst | 1 + Documentation/misc-devices/smpro-errmon.rst | 198 ++++++++++++++++++++ 2 files changed, 199 insertions(+) create mode 100644 Documentation/misc-devices/smpro-errmon.rst diff --git a/Documentation/misc-devices/index.rst b/Documentation/misc-devices/index.rst index 30ac58f81901..7a6a6263cbab 100644 --- a/Documentation/misc-devices/index.rst +++ b/Documentation/misc-devices/index.rst @@ -26,6 +26,7 @@ fit into other categories. lis3lv02d max6875 pci-endpoint-test + smpro-errmon spear-pcie-gadget uacce xilinx_sdfec diff --git a/Documentation/misc-devices/smpro-errmon.rst b/Documentation/misc-devices/smpro-errmon.rst new file mode 100644 index 000000000000..53599904da70 --- /dev/null +++ b/Documentation/misc-devices/smpro-errmon.rst @@ -0,0 +1,198 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +Kernel driver Ampere(R)'s Altra(R) SMpro errmon +=============================================== + +Supported chips: + + * Ampere(R) Altra(R) + + Prefix: 'smpro' + + Preference: Altra SoC BMC Interface Specification + +Author: Thu Nguyen <thu@xxxxxxxxxxxxxxxxxxxxxx> + +Description +----------- + +This driver supports hardware monitoring for Ampere(R) Altra(R) SoC's based on the +SMpro co-processor (SMpro). +The following SoC alert/event types are supported by the errmon driver: + +* Core CE/UE errors +* Memory CE/UE errors +* PCIe CE/UE errors +* Other CE/UE errors +* Internal SMpro/PMpro errors +* VRD hot +* VRD warn/fault +* DIMM Hot + +The SMpro interface provides the registers to query the status of the SoC alerts/events +and their data and export to userspace by this driver. + +Usage Notes +----------- + +SMpro errmon driver creates the sysfs files for each host alert/event type. +Example: ``error_core_ce`` to get Core CE error type. + +To get a host alert/event type, the user will read the corresponding sysfs file. + +* If the alert/event is absented, the sysfs file returns empty. +* If the alerts/events are presented, one each read to the sysfs, the oldest alert/event will be reported until all the errors are read out.. + +The format of the error lines is defended on the alert/event type. + +1) Type 1 for Core/Memory/PCIe/Other CE/UE alert types:: + + <Error Type><Error SubType><Instance><Error Status><Error Address><Error Misc 0><Error Misc 1><Error Misc2><Error Misc 3> + + Where: + * Error Type: The hardwares cause the errors in format of two hex characters. + * SubType: Sub type of error in the specified hardware error in format of two hex characters. + * Instance: Combination of the socket, channel, slot cause the error in format of four hex characters. + * Error Status: Encode of error status in format of eight hex characters. + * Error Address: The address in device causes the errors in format of sixteen hex characters. + * Error Misc 0/1/2/3: Addition info about the errors. Each field is in format of sixteen hex characters. + + Example: + # cat error_other_ce + 0a020000000030e400000000000000800000020000000000000000000000000000000000000000000000000000000000 + + The size of the alert buffer for this error type is 8 alerts. + When the buffer is overflowed, the read to overflow_other_ce will return 1, otherwise it returns 0. + + Example: + # cat overflow_other_ce + 1 + +Below table defines the value of Error types, Sub Types, Sub component and instance: + + ============ ========== ========= =============== ================ + Error Group Error Type Sub type Sub component Instance + CPM 0 0 Snoop-Logic CPM # + CPM 0 2 Armv8 Core 1 CPM # + MCU 1 1 ERR1 MCU # | SLOT << 11 + MCU 1 2 ERR2 MCU # | SLOT << 11 + MCU 1 3 ERR3 MCU # + MCU 1 4 ERR4 MCU # + MCU 1 5 ERR5 MCU # + MCU 1 6 ERR6 MCU # + MCU 1 7 Link Error MCU # + Mesh 2 0 Cross Point X | (Y << 5) | NS <<11 + Mesh 2 1 Home Node(IO) X | (Y << 5) | NS <<11 + Mesh 2 2 Home Node(Mem) X | (Y << 5) | NS <<11 | device<<12 + Mesh 2 4 CCIX Node X | (Y << 5) | NS <<11 + 2P Link 3 0 N/A Altra 2P Link # + GIC 5 0 ERR0 0 + GIC 5 1 ERR1 0 + GIC 5 2 ERR2 0 + GIC 5 3 ERR3 0 + GIC 5 4 ERR4 0 + GIC 5 5 ERR5 0 + GIC 5 6 ERR6 0 + GIC 5 7 ERR7 0 + GIC 5 8 ERR8 0 + GIC 5 9 ERR9 0 + GIC 5 10 ERR10 0 + GIC 5 11 ERR11 0 + GIC 5 12 ERR12 0 + GIC 5 13-21 ERR13 RC# + 1 + SMMU 6 TCU 100 RC # + SMMU 6 TBU0 0 RC # + SMMU 6 TBU1 1 RC # + SMMU 6 TBU2 2 RC # + SMMU 6 TBU3 3 RC # + SMMU 6 TBU4 4 RC # + SMMU 6 TBU5 5 RC # + SMMU 6 TBU6 6 RC # + SMMU 6 TBU7 7 RC # + SMMU 6 TBU8 8 RC # + SMMU 6 TBU9 9 RC # + PCIe AER 7 Root 0 RC # + PCIe AER 7 Device 1 RC # + PCIe RC 8 RCA HB 0 RC # + PCIe RC 8 RCB HB 1 RC # + PCIe RC 8 RASDP 8 RC # + OCM 9 ERR0 0 0 + OCM 9 ERR1 1 0 + OCM 9 ERR2 2 0 + SMpro 10 ERR0 0 0 + SMpro 10 ERR1 1 0 + SMpro 10 MPA_ERR 2 0 + PMpro 11 ERR0 0 0 + PMpro 11 ERR1 1 0 + PMpro 11 MPA_ERR 2 0 + ============= ========== ========= =============== ================ + + +2) Type 2 for the Internal SMpro/PMpro alert types:: + + <Error Type><Error SubType><Direction><Error Location><Error Code><Error Data> + + Where: + * Error Type: SMpro/PMpro Error types in format of two hex characters. + + 1: Warning + + 2: Error + + 4: Error with data + * Error SubType: SMpro/PMpro Image Code in format of two hex characters. + * Direction: Direction in format of two hex characters. + + 0: Enter + + 1: Exit + * Error Location: SMpro/PMpro Module Location code in format of two hex characters. + * Error Code: SMpro/PMpro Error code in format of four hex characters. + * Error Data: Extensive datae in format of eight hex characters. + All bits are 0 when Error Type is warning or error. + + Example: + # cat errors_smpro + 01040108003500000000 + +3) Type 3 for the VRD hot, VRD /warn/fault, DIMM Hot event:: + + <Event Channel><Event Data> + + Where: + * Event channel: + 00: VRD Warning Fault + 01: VRD Hot + 02: DIMM hot + * Event Data: Extensive data if have in format of four hex characters. + + Example: + #cat event_vrd_hot + 010000 + +Sysfs entries +------------- + +The following sysfs files are supported: + +* Ampere(R) Altra(R): + +Alert Types: + + ================= =============== =========================================================== ======= + Alert Type Sysfs name Description Format + Core CE Errors errors_core_ce Triggered by CPU when Core has an CE error 1 + Core UE Errors errors_core_ue Triggered by CPU when Core has an UE error 1 + Memory CE Errors errors_mem_ce Triggered by CPU when Memory has an CE error 1 + Memory UE Errors errors_mem_ue Triggered by CPU when Memory has an UE error 1 + PCIe CE Errors errors_pcie_ce Triggered by CPU when any PCIe controller has any CE error 1 + PCIe UE Errors errors_pcie_ue Triggered by CPU when any PCIe controller has any UE error 1 + Other CE Errors errors_other_ce Triggered by CPU when any Others CE error 1 + Other UE Errors errors_other_ue Triggered by CPU when any Others UE error 1 + SMpro Errors errors_smpro Triggered by CPU when system have SMpro error 2 + PMpro Errors errors_pmpro Triggered by CPU when system have PMpro error 2 + ================= =============== =========================================================== ======= + +Event Type: + + ============================ ========================== =========== ======================== + Event Type Sysfs name Event Type Sub Type + VRD HOT event_vrd_hot 0 0: SoC, 1: Core, 2: DIMM + VR Warn/Fault event_vrd_warn_fault 1 0: SoC, 1: Core, 2: DIMM + DIMM Hot event_dimm_hot 2 NA (Default 0) + ============================ ========================== =========== ======================== -- 2.35.1