On Mon, Apr 30, 2018 at 04:33:52PM -0500, Alexandru Gagniuc wrote: > The policy was to panic() when GHES said that an error is "Fatal". > This logic is wrong for several reasons, as it doesn't take into > account what caused the error. > > PCIe fatal errors indicate that the link to a device is either > unstable or unusable. They don't indicate that the machine is on fire, > and they are not severe enough that we need to panic(). Instead of > relying on crackmonkey firmware, evaluate the error severity based on ^^^^^^^^^^^^ Please keep the smartass formulations for the ML only and do not let them leak into commit messages. > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@xxxxxxxxx> > --- > drivers/acpi/apei/ghes.c | 45 ++++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 42 insertions(+), 3 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index c9f1971333c1..49318fba409c 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int > * GHES_SEV_RECOVERABLE -> AER_NONFATAL > * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL > * These both need to be reported and recovered from by the AER driver. > - * GHES_SEV_PANIC does not make it to this handling since the kernel must > - * panic. > + * GHES_SEV_PANIC -> AER_FATAL > */ > static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) > { > @@ -459,6 +458,46 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) > #endif > } > > +/* PCIe errors should not cause a panic. */ > +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata) > +{ > + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); > + > + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && > + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO && > + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER)) How is PCIe error severity dependent on whether the AER error reporting driver is enabled (and possibly not even loaded) on the system? > + return CPER_SEV_RECOVERABLE; > + > + return ghes_cper_severity(gdata->error_severity); > +} > +/* -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html